M 
|. 
1 
3 
j 


pp" 


INTRODUCTION 
TO THE 
THEORY OF STATISTICS 


INTRODUCTION 


TO THE 


THEORY OF STATISTICS 


BY 


ALEXANDER M cFARLANE MOOD 


The RAND Corporation 
Formerly Professor of Statistics, Iowa State College 


| First EDITION 
| SECOND Impression 


New York Toronto London 
McGRAW-HILL BOOK COMPANY, INC. 
1950 


INTRODUCTION TO THE THEORY OF STATISTICS 


Copyright, 1950, by the McGraw-Hill Book Company, Inc. Printed in the 
United States of America. All rights reserved. This book, or parts thereof, 
may not be reproduced in any form without permission of the publishers. 


Bureau Edni Psy. Re 

«sy. Research 
DAVID HARE TRAINING COLLEGE 
Dated es 


ces, No A68. 


m 


THE MAPLE PRESS COMPANY, YORK, PA. 


LI 


To 


HARRIET 


24^ ! 


PREFACE 


This book developed from a set of notes which I prepared in 1945. 
At that time there was no modern text available specifically designed 
for beginning students of mathematical statistics. Since then the 
situation has been relieved considerably, and had I known in advance 
what books were in the making it is likely that I should not have 
embarked on this volume. However, it seemed sufficiently different 
from other presentations to give prospective teachers and students a 
useful alternative choice. 

The afore-mentioned notes were used as text material for three years 
at Iowa State College in a course offered to senior and first-year 
graduate students. The only prerequisite for the course was one year 
of calculus, and this requirement indicates the level of the book. (The 
calculus class at Iowa State met four hours per week and included good 
coverage of Taylor series, partial differentiation, and multiple integra- 
tion.) No previous knowledge of statistics is assumed. 

This is a statistics book, not a mathematics book, as any mathe- 
matician will readily see. Little mathematical rigor is to be found in 
the derivations simply because it would be boring and largely a waste 
of time at this level. Of course rigorous thinking is quite essential to 
good statistics, and I have been at some pains to make a show of rigor 
and to instill an appreciation for rigor by pointing out various pitfalls 
of loose arguments. 

While this text is primarily concerned with the theory of statistics, 
full cognizance has been taken of those students who fear that a 
moment may be wasted in mathematical frivolity. All new subjects 
are supplied with a little scenery from practical affairs, and, more 
important, a serious effort has been made in the problems to illustrate 
the variety of ways in which the theory may be applied. 

The problems are an essential part of the book. They range from 
simple numerical examples to theorems needed in subsequent chapters. 
They include important subjects which could easily take precedence 
over material in the text; the relegation of subjects to problems was 
based rather on the feasibility of such a procedure than on the priority 
of the subject. For example, the matter of correlation is dealt with 
almost entirely in the problems. It seemed to me inefficient to cover 

vil 


PREFACE 


multivariate situations twice in detail, i.e., with the regression model 
and with the correlation model. ‘The emphasis in the text proper ison 
the more general regression model. 

The author of a textbook is indebted to practically everyone who 
has touched the field, and I here bow to all statisticians. However, in 
giving credit to contributors one must draw the line somewhere, and I 
have simplified matters by drawing it very high; only the most eminent 
contributors are mentioned in the book. 

My greatest personal debt is to S. S. Wilks, who kindled my interest 
in statistics and who was my mentor throughout my term of graduate 
study. Any merits which this book may have must be charged largely 
to his careful lectures and understanding direction of my studies. 

My colleagues at Iowa State College have all contributed much to 
my understanding and general view of statistics. I am particularly 
aware of large debts to G. W. Brown, W. G. Cochran, and G. W. 
Snedecor. Among the many students who thoroughly revised the 
original notes by their excellent comments and suggestions I must men- 
tion H. D. Block, who gave the final manuscript a very careful and 
competent review. Margaret Kirwin and Ruth Burns accurately 
translated my scrawl into beautiful typescript. Bernice Brown and 
Miss Burns carefully proofread the entire set of galleys. 

I am indebted to Catherine Thompson and Maxine Merrington, 
and to E. S. Pearson, editor of Biometrika, for permission to include 
Tables III and V, which are abridged versions of tables published in 
Biometrika. I am also indebted to Professors R. A. Fisher and Frank 
Yates, and to Messrs. Oliver and Boyd, Ltd., Edinburgh, for permission 
to reprint Table IV from their book “Statistical Tables for Use in 
Biological, Agricultural and Medical Research.” 

In the final chapter are some distribution-free tests which were 
developed jointly by G. W. Brown and myself at Iowa State College on 
a project sponsored by the Office of Naval Research. Professor Brown 
has véry generously and graciously permitted me to include this mate- 
rial which should have first appeared in print under his name as well as 
mine. The tests referred to are presented in Sections 5, 6, 7, 8, and 9 of 
Chapter 16. 

ALEXANDER McFAnLANE Moop 

Santa Monica, Calif. 

January, 1950 


viii 


CONTENTS 


PRR A CH 2. yi: tnt col paren caves cite een ee cease as) chat Paley ed EHI TOMUS vii 
CHAPTER 1. INTRODUCTION 
TCT ESTE IE eese e TE N a a E E sae oer a 1 
1.2 The Design of Experiments and Investigations . . . . . +--+ 1 
1:3) ‘Statistical Inference... 2 . - 9. 4o xo e os m m 9 moy eom 3 
1.4 The Theory and Practice of Statistics . . . . . . . o so nn 4 
1.5) “The: Scope-of This Book. +... he oa Snow le eS oe ee 6 
1:6: & Reference: Systeme De o alo) Pes n eese i ON me S Es iff 


CHAPTER 2. PROBABILITY AND COMBINATORIAL METHODS 


21 Definition of Probability... a 6 pe se. se eee NS 8 
22 Permutations and Combinations. ...... +--+ + + +e ee 10 
23. CSürngs Formula ste a a ob we) vag) pct) oj age tere Tom 16 
2.4 Sum and Product Notations. . o n n n n 16 
2.5 The Binomial and Multinomial Theorems ....-.- +--+ +++ 17 
2.0 Combinatorial Generating Functions... . ort t 19 
2.7 Marginal and Conditional Probability... . «es 28 
2.8 Two Basic Laws of Probability |... n n n 27 
2:0. ‘Compound Events. s» «4... 9. 1 a si e E ren ere eger Ies 30 
2.10 A Priori and Empirical Probabilities . . . . «rne 36 
2.11 Notes and References... on n t t t t t t 38 
9:19? Problemi. toss SO EL ndr EL Mee eb e ey m A 38 
CHAPTER 3. DISCRETE DISTRIBUTIONS 
2:13 c üntroductiod, oen: uct, Merz v oS E ro o ee n hess Site uber 44 
3.2 Discrete Density Functions. . . o o n t t n n 46 
3.33 Multivariate Distribution. . . . 4 . o t t t nn 47 
3.4 The Binomial Distribution... . 6+ +++ eee . 54 
3.5 The Multinomial Distribution... . o o t n e 58 
3.0 The Poisson Distribution. . Ss rr oe tnn 59 
3.7 Other Discrete Distributions . . . . s o o n t n n 61 
SR. CPDblemm sds erie a dant et a dp Met MES 62 


4.1 
4.2 
43 
44 
4.5 


CHAPTER 4. DISTRIBUTIONS FOR CONTINUOUS VARIATES 


Continuous Vari&teS s . pep € RR eoe Sobre gets 65 
Probability Functions for Continuous Variates. . . . . . . +. 68 
Multivariate Distributions. . . . ©- > © s o o ot t nn 74 
Cumulative Distributions. . . . . < o eoe soe e rosoe t tne 76 
Marginal Distributions. . 2. o t ot t o t t nn 82 


46 Conditional Distributions. ....-.-..-- +2 2st eee 83 
AW Independence. da. js d E S| E ems oe moet xD tee de 85 
Cuz MA Ti E CONOREMUR T] DUE. COUR OR IE ZI ORIS E 86 
CHAPTER 5. EXPECTED VALUES AND MOMENTS 
"Ei Expested Valües. 5 9 P e rS E sre oem dece sh CH 
BO OMOnentsceid OL e ov wigs Ses) en Ponit n nos às Jo 75. 22-108 
5.3 Moment Generating Functions... . , s. o so> >% +o ER a) 
5.4 Moments for Multivariate Distributions . . . . . . ss oo 102 
BB ‘The Moment Problem... 4» 2 e e See d raapi moe ree 103 
BOEDI LS S a a N a ee ie) ee ap om de aE LOS 
CHAPTER 6. SPECIAL CONTINUOUS DISTRIBUTIONS 
Be Unior bist DUO. 1e a saben as a S eme ne dins 107 
6.2 The Normal Distribution. . . . > . > s< ess vo OE 108 
6:8 The Gamma Distribution. ... ... «cn UEM DIS 112 
Bi Pho'Beta Distribution «a « ox ol ow ee omnem s 115 
6.5 Other Distribution Functions... . 2... n n n e+ MT 
Gui dust Vt tai a tebe: “Sy aE be lr a ter Yar sat yell. stig “on bh ie 120 
CHAPTER 7, SAMPLING 
AR SINCUGHIVETMGTENOS. A sete ue o9 gs e oS he late Y 124 
7.2 Populations and Samples. . . . . «er n n n 126 
78 (Sample Distributions. |. 4. < mr rt Rr 128 
Pob HamnleMOomen x21 32-12 * ES. 5 130 
7.5 The Law of Large Numbers........-.,.-- A ae 133 
on The: Gentral-limit Theorem .., i). 420-0278 sa € fate a wee s 136 
7.7 Normal Approximation to the Binomial Distribution. . . . . . . . 139 
7.8 Role of the Normal Distribution in Statistics . . . . . . . . . . . 142 
ED TESS) IBI Secun» 19 Me er ha a Fac ra Met n RI BD 143 
CHAPTER 8. POINT ESTIMATION 
Sil. ExthiabNon of Parameterssc: so S ne RES e IS wee den 147 
8.2 Properties of Good Estimators. . ssas 2s rre 148 
8.3 Principle of Maximum Likelihood . . . .. ...... s. . . 152 
8.4 Some Maximum-likelihood Estimators . ............ . 194 
8.5 Properties of Maximum-likelihood Estimators... ........, 158 
ee ENOTO ANG: Referenced tions. 6) PR ids ay NU PERIE p als c 161 
CUPIS e tech e ese vette ER io aie Ae Fick Sk alse ws vg ET Ta 161 
CHAPTER 9. THE MULTIVARIATE NORMAL DISTRIBUTION 
9.1 The Bivariate Normal Distribution . . . ............ 165 
9:2 Matrices and Determinants... . 4... 4.24 ese 170 
9.3 The Bivariate Normal Distribution in Matrix Notation. . . . . .. 176 
9.4 The Multivariate Normal Distribution. ............. 177 
9.5 Marginal and Conditional Distributions . . . . . .. s. a 181 
9.6 The Moment Generating Function. ......-.-....... 184 
it STEGO MSS T em cR CS IRR nad ELS I A 186 
Qi WIPLOBIGMS Hs vv vor ipe nots dim E pete MEOS NITE RES 188 


10.1 
10.2 
10.3 
10.4 


10.5 
10.6 
10.7 
10.8 
10.9 
10.10 


11.1 
11.2 
113 
114 
11.5 
11.6 
11.7 
11.8 
11.9 


12.1 
12.2 
12.8 
12.4 
12.5 
12.6 
12.7 
12.8 
12.9 
12.10 
12.11 
12.12 


13.1 
182 ' 
13.8 
13.4 
13.5 
13.6 
13.7 


CONTENTS 


CHAPTER 10. SAMPLING DISTRIBUTIONS 


Distributions of Functions of Random Variables. . . . . . . .- n 
Distribution of the Sample Mean for Normal Populations. . . . . . 
The Chi-square Distribution... . . rr 
Independence of the Sample Mean and Variance for Normal Populi 
VlohH. ox rt ar tr NOE Qe sy o9 rs FUR PT 
The IDISEDUUOD s seai ENSE A 2 0 MO C NER 
"Student's" Distribution. . . ies «rtr te 
Distribution of Sample Means for Binomial and Poisson Populations. 
Large-sample Distribution of Maximum-likelihood Estimators. 
Applications of the Large-sample Theory. . . . >.. 

Problemes niat A ST OA NS eerie =) x 


CHAPTER 11. INTERVAL ESTIMATION 


Confidence Intervals. . . . s s soe s om o n n n OE ul 
Confidence Intervals for the Mean of a Normal Distribution Es 
Confidence Intervals for the Variance of a Normal Distribution. . . 
Confidence Region for Mean and Variance of a Normal Distribution . 
A General Method for Obtaining Confidence Intervals . . . . . . - 
Confidence Intervals for the Parameter of a Binomial Distribution , 
Confidence Intervals for Large Samples... . . . . . -- ds 
Confidence Regions for Large Samples . . ... cre 
bg D IST E ETC E CUN Erde ak ccs se MPEG LE 


CHAPTER 12. TESTS OF HYPOTHESES 


introduciions. e ue Ae reed res Rises ROUES SUR TAS ela Bogie b 
Test of a Hypothesis against a Single Alternative. . . . . .. 
Tests for Several Alternative Hypotheses. . . . . 
Simple and Composite Hypotheses... . . ss so ++: AIME 
The Likelihood-ratio Test and Its Large-sample Distribution . . . 
Tests on the Mean of a Normal Population. . . . . . - QUT 
The Difference between Means of Two Normal Populations. . . . . 
Tests on the Variance of a Normal Distribution. . . . . . s.. 
The Goodness-of-fit Test. 2. . rr t nn 3 ; 
Tests of Independence in Contingency Tables... . . . +. +. 
Notes and References. . . . s s PA la ae zo 0) 
Problenise wet, are 0m A aris. te LO, Pe EARS. A tee AST a O 


CHAPTER 13. REGRESSION AND LINEAR HYPOTHESES 


Families of Populations, . . «4 4 se ee t t t t n e 
Simple Linear Normal Regression . . . . -s s soosoo 

PAOR ana Ee x ors A Mis set ART ER JA BAT 
DiscMmiBabione. s. s sri ween va operons os cene MU aee TEES s 
Multiple Regression. 2 «4 s t t tt tt rr 
Linear Hypotheses... - - ett ttt tne 
Applications of Normal Regression IEheory a cse n MSAN S, 7 


xi 


13.8 
13.9 
13.10 


CONTENTS 


The Method of Least Squares. . . . 2. nona 222248 309 
Notes and References; e sas Ur RENTES e qns 311 
Problems Wack 2 peat EM STRIS eel 312 


CHAPTER 14. EXPERIMENTAL DESIGNS AND THE ANALYSIS OF 


14.1 
14.2 
14.3 
14.4 
14.5 
14.6 
14.7 
14.8 
14.9 
14.10 
14.11 
14.12 
14.13 
14,14 
14.15 


15.1 
15.2 
15.3 
15.4 
15.5 
15.6 
15.7 
15.8 
15.9 


16.1 
16.2 
16.3 
16.4 
16.5 
16.6 
16.7 
16.8 
16.9 
16.10 
16.11 
16.12 
16.13 


VARIANCE 
experimental Denim sinh. cs seme INO ee d) Y 316 
Analysis of Variance in Regression. ............4.4., 318 
OnesfactorJuxperimenfa: . se o a ga ee eor oer i 323 
An Application of Normal Regression Theory. .......... 326 
Two-factor Experiments with One Observation per Cell. . . . . . 329 
Two-factor Experiments with Several Observations per Cell, . . . . 334 
"IhreefatoR E permüent8. o 2 6. onm o os o oa 337 
Latin and Greco-Latin Squares... ..........2..., 339 
Components-of-variance Models... ........~. : 342 
Components of Variance for Two-factor and Three-factor Experiments 345 
M TOES ETRE eges RE Seite a fi <u eR NON. . . 948 
Ansiysis ot COVANHBHOO, . ose sus VV WS x tee a oo ee 350 
Analysis of Adjusted Means . . .. 2... ee ee 356 
NotesandKéferenees . . . . o Cos Y sole e ‘hice EIE eed 
RID D er cue ting RR CREE e. o S ke ae 359 


CHAPTER 15. SEQUENTIAL TESTS OF HYPOTHESES 


PETION NANAI . uo. ur BAUER QUIS NOS 365 
Construction of Sequential Tests . . ............ Ln 366 
DOWENBTURUDNE GELT o cs. PORA TO EN cabrio RIED UN 369 
Averegoisample:Siso. S =. . A cans uec a DETTO. 372 
Barmplingsinspecwon'.: coded eS LOMA DEQE CLIENT. 375 
Sequential Sampling Inspection. . . 2... 2... en 377 
Sequential Test for the Mean of a Normal Population . . . . . . . 380 
Mobes imd Ruéfarensee ec Sir dese & Iac. vv ET S S LO 382 
SRO DESIR dem eM WRC RRO re oh es ie Mil oa ge 382 


Intirodqhion ecco ero: «Seta ae Maa. COLA eid 385 
SD BRIGADIStTUD OORT ae oa oye cent cpt) M Foe aoe 385 
Loca tionsand (Dispersion: («es ey coe trc Lek Luba ns 387 
Comparison of Two Populations. se sa saaa a 390 
A Distribution-free Test for One-factor Experiments. . . ... . .. 398 
"Two-factor Experiments, One Observation per-Cells e re ER 399 
Two-factor Experiments, Several Observations per Gell; sv: moe d 402 
Enmple)bnsr Regression. as tay nn i 406 
General Lione Regressions. A a iste at chee ey EB Ros. 408 
deeta mf Assodiebion- cepe oe Y incec «hee oe det 410 
Power PunoWonss 3 s vasra cr RU AE RUE RESORT 414 
Notes and! References: m7: a dcos. cee een) NN 415 
:Próblemg, S 2 5.9 25 MEE ecc C RN EE 415 


TABLbS E gon 


I. Ordinates of the Normal Density Function 
IL Cumulative Normal Distribution . 
III. Cumulative Chi-square Distribution . . ......... ss. 
IV. Cumulative "Student's" Distribution . 
V. Cumulative F Distribution . . . . 


CHAPTER 1 NO se 
INTRODUCTION 


1.1. Statistics. In order to place this book in its proper perspective, 
it is necessary to consider first what statistics is. The lay conception 
of statistics ordinarily includes the collection of large masses of data 
and the presentation of such data in tables or charts; it may also 
include the calculation of totals, averages, percentages, and the like. 
In any case this conception is about thirty years out of date; these 
more or less routine operations are only an incidental part of statistics 
today. 

We shall describe statistics as the technology of the scientific method. 
Statistics provides tools and techniques for research workers. These 
tools may be of quite general application and useful in any field of 
science—physical, biological, or social. On the other hand certain 
tools may be particularly designed for special fields of research. 

We shall not embark on a discussion of the scientifie method here, 
but we may recall its three main aspects: (1) the performance of experi- 
ments, (2) the drawing of objective conclusions from experiments, and 
(3) the construction of laws to simplify the description of the conclu- 
sions of large classes of experiments. Statistics is primarily concerned 
with the first two of these aspects; in fact, the field of statistics is com- 
monly thought of as being divided into the two areas corresponding to 
these two aspects: (1) the design of experiments and investigations, (2) 
statistical inference. We shall continue our description of statistics 
by discussing these areas briefly in the following two sections. 

1.2. The Design of Experiments and Investigations. An experi- 
ment is meant to study the effect of variation of certain factors or the 
relation between certain factors. Thus one may wish to study the 
relation between temperature and pressure in a fixed volume of a gas. 
Or one may wish to discover what if any effect on milk production 
results from altering the proportion of roughage in a cow’s diet. 
Again one may wish to study the effect on the retail price of a certain 
commodity when a given public policy regarding the commodity is 
promulgated. 

In the typical experiment the research worker is harassed by addi- 

1 


S12 INTRODUCTION 


tional factors which influence the outcome of the experiment, factors 
which he would like to eliminate but cannot control completely. 
These extraneous factors are least important in the physical Sciences, 
where the experimenter has good control over his experimental 
material. They are quite important in the biological Sciences, where 
the geneticist must deal with animals each having its own peculiar 
genetic inheritance, the plant breeder must deal with whatever varie- 
ties happen to be available, do his experiments in whatever soil is at 
hand and in whatever weather conditions may occur. The extraneous 
factors become most troublesome in the social sciences, where the 
research worker frequently has no control at all over his experimental 
material. Studies in these sciences are often investigations rather 
than experiments. 

Statistics is concerned with these extraneous factors—with design- 
ing the experiment so as to eliminate them if possible or to minimize 
their effects, with arranging the experiment in space or time so that 
the effects may be expected to cancel or partially cancel themselves, 
with designing the experiment so that the effects may be removed or 
partially removed in the analysis of the resulting data. The design 
may be nothing more than an obvious application of common sense. 
Thus suppose batches of the same material from several different 
sources are to be analyzed in order to determine whether they are 
sufficiently alike to be treated the same way in some manufacturing 
process. A number of specimens chosen at random from each batch 
are to be analyzed; two men are to do the individual analyses. It is 
plain that the specimens from each batch should be divided equally 
between the two analysts, else variations due to differences in the 
analysts’ techniques will appear in the final results as differences 
between batches. Experimental designs range from such trivial 
devices as this to highly elaborate arrangements based on the mathe- 
matical theory of finite geometries. 

In designing investigations, the problem is normally one of balanc- 
ing extraneous factors by selecting representative samples. Thus sup- 
pose a political party, in order to judge how actively it should campaign 
in a given state, employs a public-opinion-polling agency to estimate 
the proportions of voters in the state who intend to vote for its candi- 
date and the rival candidate. The polling agency will do this by 
interviewing a sample of voters in the state. It is clear that the factor 
in which the agency is interested (proportions of voters favoring the 
two candidates) will be widely influenced by a great many other factors 
in which it is not directly interested. For example, farmers as a group 

2 


STATISTICAL INFERENCE 81.3 


and laborers as a group may feel quite differently about the candi- 
dates. The agency must control this factor by making the proportions 
of people in various occupational groups in the sample equal to those 
proportions in the state. It should make the proportions of people in 
various racial groups in the sample equal those for the state. The 
proportions of people at different economic levels should be the same. 
The proportions of people in different geographical areas should be the 
same. And soon. The sample should, in short, be as representative 
as possible of the population of the state. The statistician is concerned 
with ways of selecting such samples or, if this is impossible or imprac- 
ticable, with ways of assessing the magnitudes of the effects of such 
extraneous factors and removing them in the final analysis of the 
results. 

1.3. Statistical Inference. New knowledge in science is usually 
found by a logically hazardous process—the process of generalizing 
from particular results. The scientist, on perceiving a certain pattern 
in the results of one or more experiments, conjectures that the pattern 
may be characteristic of a large class of possible experiments. The 
conjecture or hypothesis would ordinarily be tested by performing 
other experiments; it might be further supported or it might be dis- 
proved. The latter outcome is by no means infrequent, for gen- 
eralization or induetive thinking is well known to lead to uncertain 
conclusions. 

The broad problem of statistical inference is to provide measures of 
the uncertainty of conclusions drawn from experimental data. This 
problem is attacked by means of the theory of probability, which 
forms the foundation of the theory of statistical inference. The tools 
of statistical inference enable the scientist to assess the reliability of 
his conclusions in terms of probability statements. To consider a 
simple example: Suppose a chemist has made three precise determina- 
tions of the atomic weight of chlorine, and suppose his results are 
35.4563, 35.4578, 35.4575. He might conclude, for example, that the 
true atomic weight is between 35.456 and 35.458. It is the function 
of statistical inference to tell the chemist to what extent he may rely 
on this conclusion. The measure of reliability might be given by a 
statement of this form: “The odds are two to one that the conclusion 
is correct.” If it is important that the chemist estimate the atomic 
weight within .002, he will likely be dissatisfied with such low odds 
and will make further determinations in order to decrease his chances 
of being wrong. He might, for example, feel that for his purposes he 
must be very confident of his conclusion and repeat his determinations 

3 


814 INTRODUCTION 


until there is only one chance in a hundred of his final conclusion being 
in error. 

It is usually impossible to make an entirely valid generalization—to 
arrive at a certain conclusion on the basis of experimental evidence. 
But it is possible to measure the uncertainty of such conclusions in 
probability terms and thus resolve to a considerable degree a very 
troublesome problem faced by every scientist. 

The scope of statistical inference is as broad as experimentation 
itself. An experiment may be intended merely to evaluate a constant, 
as in the illustration just given, or it may be meant to evaluate param- 
eters in a function, or perhaps to estimate a function itself, or a set of 
functions. An experiment may be designed to test a certain hypothe- 
sis suggested by a tentative theory—the hypothesis that two factors 
are unrelated, that a relation has a specified functional form. The 
experimenter may have to contend with relatively small effects from 
extraneous factors, as in the physical sciences, or with quite large 
ones, as in the social sciences. In any case the problem of statistical 
inference arises. If an experiment indicates that a certain hypothesis 
is false, the hypothesis may nevertheless remain tenable in the experi- 
menter’s mind if that conclusion is not supported by heavy odds. 
The certainty of a conclusion is often as important as the conclusion 
itself in the final evaluation of an experiment. 

1.4. The Theory and Practice of Statistics. Another division of the 
field of statistics worth brief consideration is that between the theory 
and the methodology. 

The theory of statistics is a branch of applied mathematics. It has 
its roots in an area of pure mathematics known as the theory of prob- 
ability, and in fact the complete structure of statistical theory in a 
broad sense may be thought of as including the theory of probability. 
And it includes other things not part of the formal theory of prob- 
ability—theoretical consequences of the principle of randomization, 
various principles of estimation, and principles of testing hypotheses. 
These principles may be regarded as axioms which augment the axioms 
of probability theory. 

The statistician is, of course, engaged in producing tools for research 
workers. Faced with a particular experimental problem, he constructs 
a mathematical model to fit the experimental situation as best he can, 
analyzes the model by mathematical methods, and finally devises 
procedures for dealing with the problem. He is guided in this work 
by the principles of the theory of statistics. 

The statistician is also engaged in developing and extending the 

4 


THE THEORY AND PRACTICE OF STATISTICS §1.4 


theory of statistics. There are many quite important problems of 
experimental design and statistical inference which remain untouched 
because the theory of statistics is not yet powerful enough to deal with 
them. The broad advance in the application of statistical methods 
during the past two decades was made possible by far-reaching develop- 
ments in the theory which immediately preceded it. 

It may be interesting to remark here on the origins of the theory of 
statistics. Certain areas of biological experimentation reached a point 
where what are now called statistical methods were imperative if 
further progress was to be made. The essentials of statistical theory 
were then evolved by the biologists themselves. This parallels the 
natural history of almost any branch of abstract knowledge, but it is 
nevertheless curious in the case of statistics. For the theory of sta- 
tistics appears to be a very natural development of the theory of prob- 
ability, which is several hundred years old; somehow it was almost 
completely overlooked by workers in that field. Incidentally the 
situation which created statistical theory still obtains; there are many 
areas of scientific experimentation ready and waiting for statistical 
methods which do not yet exist. 

In contradistinction to the theory of statistics is the practice of 
statistics. There is a great body of tools and techniques for research 
workers which expands appreciably with the passing of each year. 
Until recent years the statistician was not much concerned with these 
tools, being content to pass them on to those who wished to use them. 
But as scientific research progresses experiments become more complex 
and the statistical tools become correspondingly complex and special- 
ized. In some areas the time has come when it is impossible for the 
research worker to become familiar with all the tools that might be 
useful to him. Furthermore, as tools become more specialized, they 
become less flexible; to fit a particular experiment the tool often has to 
be modified, and this requires knowledge of statistical theory. 

The use of statistical tools is not merely a matter of picking out the 
wrench that fits the bolt; it is more a matter of selecting the correct 
one of several wrenches which appear to fit the bolt about equally well 
but none of which fit it exactly. It is a long step from an algebraic 
formula to, for example, a nutrition experiment on hogs. There is 
nothing magic about the formula; it is merely a tool, and moreover a 
tool derived from some simple mathematical model which cannot 
possibly represent the actual situation with any great precision. In 
using the tool one must make a whole series of judgments relative to the 


nature and magnitude of the various errors engendered by the dis- 
5 


§1.5 INTRODUCTION 


crepancies between the model and the actual experiment. These 
judgments cannot well be made by either the statistician or the experi- 
menter, for they depend both on the nature of statistical theory and 
the nature of the experimental material. 

To meet this development, the applied statistician has come on the 
scene. He is to be found in various industrial and academic research 
centers, and his function is, of course, to collaborate with the research 
workers in their experimentation and investigation. He must be 
completely familiar with both the theory and methodology of sta- 
tistics even though his work is concerned not with the field of statistics 
at all but with the field of application. We merely wish to observe 
here that applied statistics has developed to the point where it may be 
regarded as a field of interest in itself. 

1.5. The Scope of This Book. This book is concerned with the 
theory rather than the applications of statistics. In the course of the 
development many tools will be derived and discussed; a secondary 
purpose of the book is to make clear the conditions under which certain 
of the important statistical tools may be employed. But our primary 
purpose is the exposition of statistical theory. 

The book is introductory in that no knowledge of statistics by the 
reader is presumed. And it is elementary in that no knowledge of 
mathematics beyond elementary calculus is presumed. This restric- 
tion of the mathematical level is necessarily costly. We shall have to 
omit entirely many interesting but more technical developments of 
the theory; the generality of theorems will be reduced; it will be neces- 
sary to make statements without proof from time to time; mathemat- 
ical rigor will be sacrificed at many points; and cumbersome arguments 
will sometimes have to be used when very simple arguments at a higher 
mathematical level exist. All these sacrifices, however, will inhibit 
our presentation rather less than one might suppose. The essential 
aspects of the theory are entirely comprehensible without higher 
mathematics. ^ 

Since statistical theory is founded on probability theory, we shall 
begin the study with a consideration of probability concepts and the 
development of certain probability theorems which will be required. 
Next we shall consider mathematical models which have been found by 
experience to approximate many common experimental situations. 
It will then be possible to study mathematically the problems of 
statistical inference and of the design and analysis of experiments and 
investigations. 

6 


REFERENCE SYSTEM 81.6 


1.6. Reference System. The chapters are divided into numbered 
sections; the numbering begins anew in each chapter. In referring 
to a section contained in the same chapter as the reference, only the 
section number is given. In referring to a section in a different chap- 
ter, the chapter number is prefixed to the section number and separated 
from it by a period. Thus Sec. 5.3 refers to Sec. 3 of Chap. 5. 

The equations are numbered anew in each section, and equation 
numbers are always enclosed in parentheses. Merely the equation 
number is given when referring to an equation in the same section as 
the reference; otherwise the section number is prefixed. Thus equa- 
tion (4.6) refers to the sixth equation of the fourth section of the same 
chapter as the reference, and equation (9.1.12) refers to the twelfth 
equation of the first section of the ninth chapter. 


CHAPTER 2 
PROBABILITY AND COMBINATORIAL METHODS 


2.1. Definition of Probability. Probability is a measure of the likeli- 
hood of occurrence of a chance event. A precise definition can be 
given in many ways, but for our immediate purposes, the following 
statement, known as the classical definition of probability, will suffice: 

If an event can occur in N mutually exclusive and equally likely ways, 
and if n of these outcomes have an attribute A, then the probability of A is 
the fraction n/N. 

We shall apply this definition to a few simple examples in order to 
illustrate its meaning. 

If an ordinary die (one of a pair of dice) is tossed, there are six pos- 
sible outcomes: any one of the six numbered faces may turn up. These 
six outcomes are mutually exclusive since two or more faces cannot 
turn up simultaneously. And, supposing the die to be fair or true, 
the six outcomes are equally likely; no one face is any more to be 
. expected than another. Now Suppose we want the probability that 
the result of a toss be an even number. Three of the six possible 
outcomes have that attribute. The probability that an even number 
will appear when a die is tossed is therefore 3¢ or 14. Similarly, the 
probability that a five will appear when a die is tossed is 1¢. The 
probability that the result of a toss will be greater than two is 24. 

To consider another example, suppose a card is drawn at random 
from an ordinary deck of playing cards. "The probability of drawing 
a spade is readily seen to be 1329 or 14. The probability of drawing 
a number between five and ten inclusive is 2425 or 8{3. 

The application of the definition is straightforward enough in these 
simple cases, but it is not always so obvious. Careful attention must 
be paid to the qualifications “mutually exclusive” and “equally 
likely.” Suppose one wished to compute the probability of getting 
two heads if a coin were tossed twice. He might reason that there 
were three possible outcomes for the two tosses: two heads, two tails, 
or one head and one tail. One of these outcomes has the desired 
attribute; therefore the probability is 14. This reasoning is faulty 
because the three given outcomes are not equally likely. The third 

8 


DEFINITION OF PROBABILITY 82.1 


outeome can occur in two ways since the head may appear on the first 
toss and the tail on the second, or the head may appear on the second 
toss and the tail on the first. "Thus there are four equally likely out- 
comes: HH, HT, TH, TT. The first of these has the desired attribute 
while the others do not. The correct probability is therefore 14. 
The result would be the same if two coins were tossed simultaneously. 

Again suppose one wished to compute the probability that a card 
drawn from an ordinary deck will be an ace or a spade. In enumerat- 
ing the favorable outcomes he might count 4 aces and 13 spades, and 
reason that there are 17 possible outcomes with the desired attribute. 
This is clearly incorrect because the events are not mutually exclusive. 
The occurrence of an ace does not preclude the occurrence of a spade. 

We note that a probability is always a number between zero and one. 
The ratio n/N must be a proper fraction since the total number of 
possible outcomes cannot be smaller than the number of outcomes with 
a specified attribute. If an event is certain to happen, its probability 
is one; while if it is certain not to happen, its probability is zero. 
Thus, the probability of obtaining an eight in tossing a die is zero. 
The probability that the outcome of tossing a die will be less than ten 
is one. 

The probabilities determined by the classical definition are called 
a priori probabilities. When one states that the probability of obtain- 
ing a head in tossing a coin is one-half, he has arrived at this result 
purely by deductive reasoning. The result does not require that any 
coin be tossed, or even be at hand. We say that if the coin is true, the 


‘probability of a head is one-half, but this is little more than saying 


the same thing in two different ways. Nothing is said about how one 
can determine whether or not a particular coin is true. 

The fact that we shall deal with ideal objects in developing the 
theory of probability will not trouble us, because that is a common 
requirement of mathematical systems. Geometry, for example, deals 
with conceptual perfect circles, lines with zero width, and so forth, but 
it is a useful branch of knowledge which can be applied to diverse 
practical problems. 

There are some rather troublesome defects in the classical, or a priori, 
approach. It is obvious, for example, that the definition of probability 
must be modified somehow when the total number of possible outcomes 
is infinite. One might seek, for example, the probability that a posi- 
tive integer drawn at random be even. The intuitive answer to this 
question is 14. If one were pressed to justify this result on the basis 
of the definition, he might reason as follows: Suppose we limit our- 

9 


§2.2 PROBABILITY AND COMBINATORIAL METHODS 


selves to the first 20 integers; 10 of these are even so that the ratio of 
favorable events to the total number is 1949 or 14. Again, if the first 
200 integers are considered, 100 of these are even, and the ratio is also 
1$. In general, the first 2N integers contain N even integers; if we 
form the ratio N/2N and let N become infinite so as to encompass the 
whole set of positive integers, the ratio remains 14. 

The above argument is plausible and the answer is plausible, but it is 
no simple matter to make the argument stand up. It depends, for 
example, on the natural ordering of the positive integers, and a differ- 
ent ordering could produce a different result. Thus, one could just 
as well order the integers in this way: 1, 3, 2; Opty 450113, 857, 
taking the first pair of odd integers, then the first even integer; the 
second pair of odd integers, then the second even integer; and so forth. 
With this ordering, one could argue that the probability of drawing an 
even integer is 14. The integers can also be ordered so that the ratio 
n/N will oscillate back and forth and never approach any definite 
value as N increases. 

There is another difficulty with the classical approach to the theory 
of probability which is deeper even than that arising in the case of an 
infinite number of outcomes. Suppose we have a coin known to be 
biased in favor of heads (it is loaded so that a head is more likely to 
appear than a tail). The two possible outcomes of tossing the coin 
are not equally likely. What is the probability of a head? The class- 
ical definition leaves us completely helpless here. 

In a situation like the above we shall simply assume that there does 
exist some definite though unknown number which gives the desired 
probability. And we shall assume that the number obeys the same 
laws as the probabilities arising from the classical definition. 

We have pointed out these difficulties merely to indicate the limita- 
tions of our approach. A complete discussion of these points belongs 
properly in a textbook on the theory of probability. There are other 
methods of defining probabilities which are logically more satisfactory 
than the one we have chosen, but ours has the advantage of simplicity. 
And as yet there is no general agreement among writers on the theory 
of probability as to what is the most satisfactory set of axioms for the 
theory. 

2.2. Permutations and Combinations. The evaluation of a priori 
probabilities requires the enumeration of all possible outcomes of a 
given chance event. This sort of enumeration can often be facilitated 
by certain combinatorial formulas which will be developed now. They 
are based on the following two basic principles: 

10 


PERMUTATIONS AND COMBINATIONS " 82.2 


(a) If an event A can occur in a total of m ways and if a different 
event B can occur in n ways, then the event A or B can occur in m + n 
ways provided A and B cannot occur simultaneously. 

(b) If an event A can occur in a total of m ways and if a different event 
B can occur in n ways, then the event A and B can occur in mn ways. 

These two ideas may be illustrated by letting A correspond to the 
drawing of a spade from a deck of cards and B correspond to the draw- 
ing of a heart. Each of these events can be done in 13 ways. The 
number of ways in which a heart or a spade can be drawn is obviously 
13 + 13 = 26. To illustrate the second principle, suppose two cards 
are drawn from the deck in such a way that one is a spade and the other 
is a heart. There are 13 X 13 = 169 ways of doing this, since with 
the ace of spades we may put any one of the 13 hearts, or with the king 
of spades we may put any one of the 13 hearts, and so on for all 13 
of the spades. 

The two principles may clearly be generalized to take account of 
more than two events. Thus, if three mutually exclusive events A 
B, and C can occur in m, n, and p ways, respectively, then the event A 
or B or C can occur in m + n + p ways, and the event A and B and C 
can occur in mnp ways. 

We shall now use the second of these principles to enumerate the 
number of arrangements of a set of objects. Let us consider the num- 
ber of arrangements of the letters a, b, c. We can pick any one of the 
three to place in the first position; either of the remaining two may be 
put in the second position, and the third position must be filled by the 
unused letter. The filling of the first position is an event which can 
oceur in three ways; the filling of the second position is an event which 
can occur in two ways, and the third event can occur in one way. The- 
three events can occur together in 3 X 2 X 1 — 6 ways. The six 
arrangements, or permulations, as they are called, are 


abc, acb, bac, bea, cab, cha 


In this simple example the elaborate method of counting was hardly 
worth while because it is easy enough to write down all the six permu- 
tations, But if we had asked for the number of permutations of six 


letters, we should have had 


6X5xX4X38xX2xX1=720 


permutations to write down. r 
It is obvious now that in general the number of permutations of n 


11 


82.2 PROBABILITY AND COMBINATORIAL METHODS 


different objects is 
n(n — 1)(n — 2)(n —3) = - + (2) (1) 


The row of dots indicates omission of intermediate factors. This 
product of an integer by all the positive integers smaller than it, is 
usually denoted more briefly by n! (read n factorial). Thus 2! = 2, 
3! = 6, 4! = 24, 5! = 120, ete. Since 

n! 


Dic 


it is common to define 0! as one, so that the relation will be consistent 
when n = 1. 

Let us now enumerate the number of permutations that may be 
made from n objects if only r of the objects are used in any given 
permutation. Reasoning as before, the first position may be filled in 
n ways, the second position may be filled in n — 1 ways, and so forth. 
When we come to the rth position, we will have used r — 1 of the 
objects so that n — (r — 1) will remain from which we can choose. 
The number of permutations of n objects taken r at a time is therefore 


n(n — 1)(n—2)---(n—r-F1) The symbol Pa, is used to 
denote this number. 
= =, -— pee e E n! 
Par = n(n — 1)(n — 2) (n-r+1)= ae (1) 


Thus the number of permutations of the four letters a, b, c, d taken two 
at a time is Pas = 4X3 — 12. On putting r = n in equation (1), 
we get the result stated earlier: that the number of permutations of n 
objects taken n at a time is n. 

With the aid of equation (1) we can now solve the following problem: 
In how many different ways can r objects be selected from n objects? 
Pa., counts all the possible selections as well as all the arrangements of 
each selection or combination. Two combinations are different if they 
are not made up of the same set of objects. Thus abc and abd are 
different three-letter combinations, while abe and bac are different 


permutations of the same combination. Let the symbol (") denote 
r 


the number of different combinations. Then it is clear that Pa, equals 


(") times r!, since each combination of r objects has r! arrangements. 


Therefore 
(") z Prr _ n(n — 1)(n — 2) +++ (m—r+1) z n! (2) 
r rI rI r=! 


12 


Aie aaa a 


PERMUTATIONS AND COMBINATIONS | .82.2 


Another common symbol for this number is C,,,,, but we shall not use 
it in this text. The number of combinations of five objects taken three 


at a time is 
5 5X4X3 6 
(3) Tce ae 

7 

5 
number of ways in which n objects may be divided into two groups, one 
group containing r objects, and the other group containing the other 
n — r objects. Now suppose we wish to divide n objects into three 
„groups containing nı, 2, na objects, respectively, with 


The number (") may be given a different interpretation. It is the 


t ni dmn. ng =n 
We shall first divide them into two groups containing n; and ns + ns 
objects. This may be done in (") ways. Then we may divide the 
second group into two groups containing ns and na objects. This may 
be done in (at i ways. Using the second principle of enumera- 


tion, the total number of ways of doing the two divisions together is 


n\(ne+ns\ _ n! (na + ns)! n! 
m Nna ni!(na + na)! nans! — ma Ino!ng! 


This type of argument may be carried further to find the number of 


ways of dividing n objects into k groups containing mi, ne, * * * , Ne 
objects with nı +n + -> +n, =n. This number is readily 
found to be 
n! 
mins: nd 3) 


Thus the number of ways of dividing four objects into three groups 

containing 1, 1, and 2 objects is 
41 

Iunone 

The expression (3) also has a second interpretation. It is the num- 
ber of different permutations of n objects when n; of the objects are 
alike and of one kind, ns are alike and of a second kind, and so forth. 
Referring to the numerical example above, there are 12 permutations 
of the letters a, b, c, c. In order to see that expression (3) gives the 


correct number, consider n different objects (for example, the letters 
13 
* e ei 
ak E 


” §2.2 PROBABILITY AND COMBINATORIAL METHODS 


a, b,c, * * * , p) arranged in a definite order. And consider a division 
of this set. of objects into k groups, the first group containing n, objects, 
the second n», and so forth. Now in the original arrangement of 
objects, replace all the objects selected for the first group by ones; all 
those selected for the second group by twos, and so forth. The result 
wil be a permutation of n; ones, n» twos, * +- , nz k's. A little 
reflection will convince one that every division of the letters into the k 
groups corresponds to a different permutation of the integers, and that 
this is the total set of permutations, because if there were another, 
there would be another division of the letters into k groups. 

We have derived three formulas in this section, not only because 
they are useful but because their derivation serves to illustrate the 
application of the two principles of enumeration given at the beginning 
of the section. It is the methods that are important. The formulas 
will aid in solving many problems, but they are useless in many others, 
and one must then fall back on the elementary principles. 

Illustrative example: If two cards are drawn from an ordinary deck, 
what is the probability that one will be a spade and the other a heart? 

Since nothing is said about the order in which the spade and the heart 
should occur, this is a problem in combinations. To compute the 
probability, we must find the total number of possible outcomes of 
two-card draws, and then find the number of these that have the 
specified attribute. The total number of two-card combinations that 
can be made up from 52 cards is be ) = 1326. And we have seen 
before that there are 13 X 13 = 169 different combinations with the 
required attribute. The probability is therefore 16% 396 = 13409. 

This problem could also be solved by regarding the different two-card 
permutations as the set of possible outcomes, The denominator of the 
ratio would then be Ps2, = 2652. To get the numerator, we consider 
that each of the 169 two-card combinations has two permutations and 
get 2 X 169 = 338 as the number of permutations with the required 
attribute. Or we may start at the beginning as follows: The num- 
ber of permutations in which the spade occurs first and the heart 
second is 13 X 13 = 169 by principle (b). And the number with the 
heart first and the spade second is the same. Either of these sets of 
permutations satisfies the specification. By principle (a) the required 
number is 169 + 169 = 338. Again we find the probability is 13499. 

Illustrative example: What is the probability that of four cards drawn 
from an ordinary deck, at least three will be spades? 


Here again we are interested in combinations, The total number 
14 


PERMUTATIONS AND COMBINATIONS 823 . 


e 
of possible four-card combinations is (*) = 270,725. To get the 


numerator: the specification, at least three spades, means either three 
or four. The number of four-card hands containing exactly three 


spades is (3 ) 39 — 11,154; the first factor is the number of three-card 


combinations of three spades, and the second is the number of ways 
à card may be selected from the other three suits; the product is taken. 
in accordance with principle (b). The number of hands with all cards 


spades is (2) — 715. By principle (a), the number of hands with the 


required attribute is 11,154 + 715 = 11,869. The required probabil- 
ity is 11,869/270,725. 
One might attempt to find the numerator by the following method: 


The number of three-card combinations of spades is n ) = 286. 


The fourth card may be either a spade or not a spade, and after three 
spades have been selected, the fourth card may be selected from the 
whole set of 49 remaining cards. Thus the required number of hands 
is 49 X 286 = 14,014. This argument is faulty because the hands 
with four spades have been counted more than once. A specific 
three-card combination of spades is AKQ, and when the jack of spades 
is drawn from the remaining 49 cards, we have the combination AKQJ. 
But we also count this combination when the AQJ is considered and 
the king is drawn from the remaining 49 cards. It is now clear that 
the hands with four spades have been counted four times in the above 
figure. We can obtain the correct result by subtracting from it three 
times the number of hands with four spades. The result is 


13 


3) = 11,869 


14,014 — 3 ( 


as before. 

Illustrative example: Seven balls are tossed into four numbered boxes 
so that each ball falls in a box and is equally likely to fall in any of the 
boxes. What is the probability that the first box will contain two 
balls? 

Since the first ball may fall in any one of four ways, the second may 
fall in any one of four ways, and so forth, the total number of possible 
outcomes is, by principle (b), 47. To enumerate the number of out- 
comes with the desired attribute, let us first divide the seven balls 


into two groups, one containing two and the other five balls. This 
15 


§2.3 PROBABILITY AND COMBINATORIAL METHODS 


may be done in () ways. Now the group of two will be put into the 


first box and the other five distributed among the other three boxes. 
This may be done, by the same reasoning as above, in 3* ways. The 


number of favorable outcomes is therefore E 35, and the desired 
probability is 
() 9, fxs 
= > 3115 


4 4h 


(The symbol £ is used to denote approximate equality.) 

2.3. Stirling’s Formula. In finding numerical values of probabili- 
ties, one is often confronted with the evaluation of long factorial expres- 
sions which are troublesome to compute by direct multiplication. If 
an adding machine is available, and there are not a great number of 
factors in the expression, it is often convenient to use logarithms. 
However, when the factors become numerous, this method also 
becomes tedious, and much labor may be saved by using Stirling’s 
formula, which gives an approximate value of n!. It is 


n! = Vmr ernt (1) 


where e is the Napierian base, 2.71828 - - - . A much more accu- 
rate approximation may be obtained by replacing the factor e" by 
e 1-077591 but this refinement is rarely used. To indicate the accu- 
racy of the formula, we may compute 10!, which is actually 3,628,800. 
Formula (1) using five-place logarithms gives 


10! = 3,599,000 
The more refined formula gives: 
10! = 3,629,000 


The error in (1) for = 10 is a little less than 1 per cent, and the 
percentage error decreases as n increases. 
2.4. Sum and Product Notations. A sum of terms such as ns + na 


7 
+ ns Hne + nz is often designated by the symbol » n; The = 
t H H . H ds 
is the capital Greek letter sigma, and in this connection it is often 
called the summation sign. The letter 7 is called the summation index. 
16 


THE BINOMIAL AND MULTINOMIAL THEOREMS §2.5 


The term following the X is called the summand. Thei = 3 below = 
indicates that the first term of the sum is obtained by putting i = 3 
in the summand. The 7 above the X indicates that the final term of 
the sum is obtained by putting 7 = 7 in the summand. The other 
terms of the sum are obtained by giving i the integral values between 
the limits 3 and 7. Thus 


(—1)*jzti = 2x1 — 82° + 478 — Bye 


IMa 


An analogous notation is obtained by substituting the capital Greek 
letter II for X. In this case the terms resulting from substituting the 
integers for the index are multiplied instead of added. Thus 


n[-«eva]-6-26996-96926-2 


a=1 


Using this notation, expression (2.3) derived previously may be written 


nt/ Tl nil. 


2.5. The Binomial and Multinomial Theorems. The expansion 
of the binomial expression (x + y)" is given in elementary algebra 
courses, and a proof of the correctness of the expansion is ordinarily by 
induction. We shall here expand the binomial by a simple combina- 
torial method which readily generalizes to the multinomial case. If 
we write the binomial in the form (@+y)\a@t+y)\(aty)--- (+y), 
which has n factors, the problem of finding the coefficient of one of the 
terms, say 27"-*y*, reduces to the problem of finding the number of ways 
of dividing the n factors into two groups. The first term of the expan- 
sion is 2”, which is obtained by selecting the x from each of the factors, 
The next term is some coefficient times xy. This term arises by 
selecting the x from n — 1 of the factors and the y from the remaining 
one. The one from which y is taken may be chosen in any of n ways; 
hence the coefficient of z^-!y is n. In general, to get the coefficient 
of 2"-"y*, we must count the number of ways of dividing the n factors 
into two groups so that one group contains a factors and the other 
n — a factors; y is selected from each factor of the first group and x 
from each factor of the second group. The number of ways of dividing 


the n factors into two such groups is (y which is the desired coeffi- 
17 


= 


2 


§2.5 PROBABILITY AND COMBINATORIAL METHODS 


cient. The binomial expansion is therefore 


(x + y) = a" + na ly + () anch oss pn 


= » C) gii (1) 


The multinomial theorem follows directly. If the expression 
(cya pA 
is multiplied out, one will obtain terms of the form 
Cavam cm 


where C is some coefficient and the exponents satisfy the relation 


k 
X; Nn =N 
i=l 


We wish to determine C. Terms of the given form arise when 2; is 
selected from n, of the n factors, £s is selected from n of the remaining 
factors, and so forth. The numbér of ways of getting such a term is 
equal to the number of ways of dividing the n factors into k groups 
containing nı, ns, * * * , ns factors. This is expression (3) of Sec. 2. 
Thus the general term of the multinomial expansion is 


k 
n! qu 
Str oe or n! pa 
mine! + + + mf e k II nl 
i21 
and we may write 
k 
zi 
, Git trt +++ tay) = nt [I 35 (2) 
n! 
mins, tini i=1 n 


We have indicated only that the summation is over the indices Ta, Na, 
`, n, "The range of each index is zero to n, but they cannot all | 
be summed independently over that range because we must have 


k 
^h; =n. The summation is over all sets of values of ni, ns, * * * 5 
ici 


nx such that their sum is n and such that each n; is an integer in the 
range zero ton inclusive. The sum is very troublesome to write down 
when n is large. We shall illustrate it for a simple case. 


(ri + 22 + 23)! = D = 


ni!n2!ns! 
NINNI 


18 


curry 


COMBINATORIAL GENERATING FUNCTIONS §2.6 


The sets of values of (ni, n», ns) which satisfy ni + ne + n; = 4 are 
(4, 0, 0), (3, 1, 0), (3, 0, 1), (2, 2, 0), (2, 1, 1), (2, 0, 2), (1, 3, 0), (1, 2, 1), 
(1, 1, 2), (1, 0, 3), (0, 4, 0), (0, 3, 1), (0, 2, 2), (0, 1, 3), (0,0, 4). The 
sum therefore has 15 terms, the first few of which are 


4! 4! 4! 4! 4! 
(214 22-2)! = gti + gy thes + gj thes + sey vini + e +t” 
= xi + 4rire + Aziz. + Oxia} +- 4+ 28 


A set of numbers such as (3, 1, 0) is called a three-part partition of 
four, (2, 6) is a two-part partition of eight. The 15 triplets of num- 
bers listed above form the complete set of ordered three-part partitions 
of four. The partitions are called ordered because the same combina- 
tion of three parts in a different order is counted as a different partition. 
If it is not specified that the partitions be ordered, the unordered ones 
are assumed; thus, the three-part partitions of four are simply (4, 0, 0), 
(3, 1, 0), (2, 2, 0), (2, 1, 1). In terms of the idea of partitions, the 
multinomial sum (2) may be described briefly as follows: the sum is 
taken over all ordered k-part partitions of n, the parts being (nı, na, 

t, 0). 

2.6. Combinatorial Generating Functions. The enumeration of 
possible outcomes and of outcomes with a certain attribute can become 
quite a complex problem. In fact, it is easy to state problems in which 
the enumeration is practically impossible. One of the most powerful 
devices for solving enumeration problems involves the use of what are 
called generating functions. "The subject of combinatorial generating 
functions is a field of mathematics in itself, and we shall consider only 
a few simple cases here. We wish merely to indicate the nature of this 
method of analysis. 4 

Let us consider the last illustration given in Sec. 2 where seven balls 
were tossed into four boxes, and consider the function 


(z1 + ta + ma + 24)? 


The coefficient of a term such as z1z$z; in the expansion of this multi- 
nomial is given by formula (2.3) as 7!/2!4!1!0!, which is just the num- 
ber of ways of dividing seven objects into four groups so that the first 


contains two objects, the second four, and so forth. So any term in - 


the multinomial expansion gives a description of a possible outcome; a 

factor such as z means five balls have fallen in the ith box, and the 

numerical coefficient of the term gives the number of ways in which 
19 


§2.6 PROBABILITY AND COMBINATORIAL METHODS 
that outcome can occur. If the z's are now all replaced by ones, the 


4 
terms become simply 7! / [| n;!, and to get the whole set of possible 
i=1 
- outcomes, we need to sum this expression over all sets of the n; whose 
sum is seven. This sum by the multinomial theorem is just 


@+14+14+1))'=f 


If we want the probability that the first box contains two balls, we 
shall sum 7!/IIn;! over all sets of n; which have nı = 2. Let us rewrite 
the term as 

7! 5! 


and now we wish to sum this over all sets such that na + n3 + n4 = 5. 


If we multiply 5!/no!ns3!n4! by 1"1"1", we have the general term of 
(1 + 1 + 1)5; hence the desired sum is 7!/2!5! times 3°, 

The function (zi: + x2 + zs + 24)’ is a simple type of generating 
function; it is an algebraic expression which is given an interpretation 
in terms of the physical problem at hand. It may be used to answer 
any of the questions that may be asked about the physical problem to 
which it is related. Thus, if the number of ways in which the first 
two boxes can each contain at least two balls is required, we would add 
the coefficients of all terms in the generating function which have 21 
and zs with powers greater than or equal to two. 

Now let us consider another problem. An urn contains five black 
and four white balls. The balls are all drawn one by one from the 
urn, and the first three drawn are placed in a black box while the last 
six are placed in a white box. What is the probability that the num- 
ber of black balls in the black box plus the number of white balls in 
the white box is equal to five? 

We may solve this problem by considering the balls of each color to 
be numbered. The total number of ways of dividing the nine objects 


z cs 9 
into two groups, the first containing three and the second six, is (5) 


To get five balls to match the color of the box containing them, we 
must clearly have two black balls in the black box and three white 


ones in the white box. The black box may be filled (5) () ways 


since there are (:) ways of picking two black ones from the five black 
t 20 


COMBINATORIAL GENERATING FUNCTIONS 82.6 


4 
ones to be among the first three drawn, and () ways of choosing one | 


white ball to be among the first three drawn. The probability is 


5\ (4 9 
2)\1 3J 
The following generating function may be related to this problem: 
(xit + 22)5(v1 + aet)4 
Here xı corresponds to the black box and zz to the white one. The 
first factor corresponds to the five black balls, and the second to the 


four white balls. We shall consider the coefficient of the term involv- 
ing xjz§. It will be a polynomial in t, and if ¢ were put equal to one, 


„the polynomial would have the value J since then we should have 


the coefficient of zjz$ in (zi; + z3)9. The coefficient of t” in the poly- 
nomial is the number of ways in which r balls can fall in boxes of the 
same color as the balls. In forming a term in «3zr$, we may choose 
certain of the z;'s from the factor (zit + a2) and the remainder from 
the other factor. Those chosen from the first factor represent black 
balls, and those chosen from the second represent white balls. Thus, 
when a black ball is associated with the black box, we get a factor t 
and when a white ball is associated with the white box, we also get a 
factor t. The power of t then gives the total number of times a ball 
is associated with a box of its color. On expanding the generating 


function, one would find the coefficient of x3x%t5 to be (5) (i) as 


before. 

The generating function is of no value for this simple problem, but 
it becomes useful if more than two colors are considered. Thus sup- 
pose an urn contained n; balls of a given color, ne of a second color, and 
ns of a third color; and suppose mı are drawn and placed in a box of 
the first color, ms are then drawn and placed in a box of the second 
color, and the remaining balls, say ms of them, are placed in a box of 
the third color. Let n be the total number of balls; then - 


n = ma + Ne + Nns = m + m: +m 
The coefficient of sre" in the function 
(ait + xa + ma) (vi + wat + x5)" (a1 + xe + ast)” 


gives the number of ways in which r balls match color of the box con- 
taining them. The coefficient is difficult to calculate in this case, but 
21 


Accesstoned No. eic 


§2.6 PROBABILITY AND COMBINATORIAL METHODS 


to find it is a straightforward procedure, while to find it without the 
generating function is considerably more troublesome. 

We shall consider one other kind of generating function. If five 
dice are tossed, what is the probability that the sum of the spots will 
be 15? 

Since the first die may fall in six ways, the second may fall in six 
ways, and so forth, the total number of possible outcomes is 6°. Now 
we need the number of these outcomes that have a sum equal to 15. 
In the case of two dice, it is easy to write down all possible combina- 
tions which give a specified sum. Thus to obtain a sum of five, the 
two dice may fall (1, 4), (2, 3), (3, 2), (4, 1). These are the ordered 
two-part partitions of five when zero is excluded as a part. In our 
problem we must enumerate all the ordered five-part partitions of 15 
which have all parts between one and six inclusive. 

In problems involving partitions of numbers, there is a generating 
function which will usually materially simplify the enumeration. For 
the particular problem of counting the ways of getting 15 with five 
dice, let us consider this function: 

(OF ate stat Erb pgt)s (1) 
It is a polynomial in z in which the term of lowest degree is z^ and the 
term of highest degree is x*°. Let us suppose that the function is 
written as the product of five factors instead of as a fifth power. The 
first factor will be associated with the first die, the second factor with 
second die, and so on. In the expansion of the function there will be 
a number of terms z!5; one, for example, will arise when z is selected 
from each of the first three factors and z* is selected from the remaining 
two factors. This situation corresponds to the appearance of a one 
on the first three dice, and a six on the other two. It is readily seen 
that there is a one-to-one correspondence between the ways «x! can 
arise in the expansion and the ways the five dice can total 15. Hence 
our required number is the coefficient of x'® in the expansion of the 
function. This coefficient may be found most easily by use of the 
following identity: 
1=2* 
EE 


=] +err p 42 (2) 


which may be verified by multiplying both sides by 1 — x. Using 
this identity, the generating function may be put in the form: 
z*(1 — 28)5 
(i = z)* 
22 


MARGINAL AND CONDITIONAL PROBABILITY 82.7 


We may omit the factor z* and find the coefficient of z'* in what 
remains, Now we need another identity: 


TET 


-$ ("time (3) 


(co 
which reduces our problem to that of finding the coefficient of x in 


S (4+ 

(1 = 25s ( : E 
2, X 

If the first factor is expanded, all but the first two terms have z to a 

higher power than 10 and may be neglected. And now the problem 

becomes that of finding the coefficient of z'* in 


aa 22 7 jE 


which has two terms in z*°: one when the 1 is multiplied by the term 
given by i = 10 in the sum, and the other when the —52* is multiplied 
by the term given by i = 4 in the sum, The coefficient is therefore 


(4) -5 (5 and the probability we set out to find is 


10 4 
m 8 
(5) -8 4) _ 651 
[2 7716. 


= .0837 


These examples will serve to indicate the kind of attack that may 
be made on enumeration problems by means of generating functions. 
‘The method is powerful, but we cannot develop it here. We merely 
wish to point out the existence of the method. 

2.1. Marginal and Conditional Probability. Suppose that there are 
n equally likely possible outcomes of a chance event, and that they 
may be classified according to two criteria. Thus the event may be 
the selection of a ball from an urn in which all the balls are colored 
and all are numbered; the possible outcomes may be classified accord- 
ing to color, or according to number. In general, suppose there is an A 
classification with r classes which we denote by Ai, As, * * * , An and 

23 


§2.7 PROBABILITY AND COMBINATORIAL METHODS 


a B classification with s classes denoted by Bi, By, >> *, B. Then 
outcomes may then be classified in a two-way table as follows: 


Bi | Bz 


A1 | ma | Me 


ner 


Nez 


Here we have indicated that mi of the m outcomes have both the 
attribute A; and the attribute Bi; mz have both the attribute 4; and 
the attribute Bz; and in general nj of the outcomes have the attributes 
* A; and B; The sum of all nj ism. As an example we may consider 
the drawing of a card from an ordinary deck of playing cards. The 
52 outcomes may be classified according to suit (say Ay, As, As, Aa), 
according to denomination (say Bi, Bs, * * * ; Big). In this example 
every ny is one. 

The probability that the event will have a given specification, A1 
and Bs, for example, will be denoted by P(A1, Bs), and the value of this 
probability is obviously n:s/m. In general, 

_ hi 
P(As Bi) = 3 
We may be interested in only one of the criteria of classification, say 4, 
and indifferent to the B classification. In this case B is omitted from 
the symbol, and the probability of As, say, is written P(A), and 


n 


P(A) = 


This is called a marginal probability, and the term marginal is used 
whenever-one or more criteria of classification are ignored. It is clear 
that 
E 
P(A) = Y ?8 
(4) =) = 


j=1 


MARGINAL AND CONDITIONAL PROBABILITY §2.7 


or 


P(A) = Y, P(A; Bi) (1) 
Pest 
since n/n = P(A; B;). Also the marginal probability of B; is 
P(B) = LP (Ai, Bi) (2) 


Thus the probability that a chance event has a specified attribute is 
the sum of all the probabilities of events that have that attribute. 
The probability that a card be an ace is the sum of the probabilities 
that it be the ace of spades, the ace of hearts, the ace of diamonds, and 
the ace of clubs. 

Tn a more general situation, suppose there are three criteria of classi- 
fication, A, B, and C. Let nj; of the n possible outcomes have the 
specification Aj, B;, Cr; and let the C classification be C1, C», * + * , Cr, 
with the A and B classes the same as before. The complete classifica- 
tion would be a three-way table consisting of t layers of two-way tables, 
each layer corresponding to a Cr. The marginal probability of, say, 
A; and C; is 


P(A;, Cx) - P(A; Bi, Cx) (3) 
and the marginal probability of C; is 
P(C) = x J P(A, By Cr) (4) 
= Ý Plas 0) (5) 
& $ P(B;, C;) (6) 


The extension of these ideas to more than three criteria of classification 
is apparent. 

Returning to the original two-way classification, suppose the out- 
come of a chance event is examined for one attribute but not for the 
other. We wish to find the probability that the other attribute has a 
specified value. The event, for example, may be observed to have the 
attribute Bs. What is the probability that it also has the attribute A5? 


'The total number of outcomes for A given that B; has occurred, is 
25 


§2.7 PROBABILITY AND COMBINATORIAL METHODS 


y nis, and the number of favorable outcomes for As are nss. Thus 

fs 

the probability of 4», given that Bs has occurred, is n23 ji À Nis. 
ic 

This is called a conditional probability and is denoted by the symbol 

P(As|Bs). In general 


ns 


P(AiB;) = -7 


n 
i=l 
Nij 


P(BjA) = = 


nig 

jel 
On dividing both the numerator and denominator of the fraction on the 
right by n, we have 


pain) = P o 
P(Bj|A) = Pfs BD 
or in another form 
P(A; Bj) = P(A|Bj)P(Bj) (9) 
= P(B|A)P(AQ al 


The last equation may be stated: the probability that an outcome will 
have the attribute A; and B; is equal to the marginal probability of 
A; multiplied by the conditional probability of B; given that A; has 
occurred. 

The idea of conditional probability has a straightforward extension 
to situations involving more than two criteria of classification. In the 
case of three criteria, for example, it may be shown directly that 


P(A, B;, Cx) 


P(A; BC.) = PUG) (11) 
P(A4B, 0) = PE Bs 09 (12) 
also that 
P(A;, Bj, Cx) = P(Ai, Bi\Cx)P(Ci) (13) 
= P(A|B;, C))P(B;, Cr) (14) 


z P(A4B;, C;)P(BjC)P(C)) (15) 
26 


(8) 


TWO BASIC LAWS OF PROBABILITY §2.8 


and other diate relations could be obtained by permuting the letters 
A, B, C. Thus 
P(A; Bi, Cr) 


P(BjA; C) = P(A; Cy) (16) 
and 
P(A; Bj Cy) = P(Bj|A;, C4)P(A;]Cx)P(C;) (17) 
or 
P(A; Bi, Cx) = P(BjAs C))P(C4A)P(A;) (18) 


We shall not take the space to write out all such possible relations, but 
the student would do well to do so. These relations are fundamental 
in the theory of statistics and must be well understood. 

In defining conditional probability we have used a rather specialized 
model. But it is apparent that the idea is quite general. Let X be 
any subset of the whole set of possible outcomes, and let Y be any sub- 
set of X; then 

TPE) 
P(Y|X) = P(X) 
for if N is the total number of outcomes, n is the number in X, and m 
is the number in Y, then P(Y|X) = m/n, P(Y) = m/N, and 


P(X) = 7 


2.8. Two Basic Laws of Probability. The two laws correspond to 
the two principles of enumeration discussed in Sec. 2. The additive 
law of probability states that 

If A and B are mutually exclusive subsets of the whole set of possible 
outcomes of a chance event, then the probability that the event occurs in A 
or Bis equal to the probability that it occurs in A plus the probability 
that it occurs in B. 

Symbolically, we may write this as 


P(A or B) = P(A) + P(B) (1) 


This law follows directly from principle (a) of Sec. 2. In general, if 
Aa, Ay, * * * , An are mutually exclusive subsets of the whole set of 


outcomes, then 
D 


P(A, or Az or As + - + or Aa) = Y P(A) (2) 
ízi 
The marginal probability defined by (7.1) is a special case of this rela- 
tion. The specification A; is fulfilled by the subsets A;, Bi; Aj Bs; 
27 


§2.8 PROBABILITY AND COMBINATORIAL METHODS 


; A;, Bs; hence 
P(A; = P(A; Bi or Aj, Bs * +» or Aj, B) 


s 
= Y P(As B) 
j=l 

Tf the two subsets A and B of (1) are not mutually exclusive, then 
(1) is no longer true. In this case, certain outcomes have both the 
attribute A and the attribute B. We may interpret this in terms of 
the two-way classification given at the beginning of Sec. 7. Suppose 
we want the probability that the outcome is in A; or By. A, consists 
of the first row of the table and Bs consists of the second column. The 
outcomes in A,B, satisfy both specifications, and thus the two sets 
A; and B; are not mutually exclusive. The probability that the out- 
come falls in A; or Bz is easily calculated by adding all n,; in the first 
row and second column and ato: by n. 


5 Taj is d Nia 


P(A, or B3) = 
Yos + E — "ia 
EE CRUS e n 


n 
= P(A1) + P(B:) — P(As, Be) (8) 
This gives us a more general law of addition of probabilities. 

If A and B are subsets of the set of outcomes of a chance event, the 

probability that the event occurs in A or Bis equal to the probability that 
it occurs in A plus the probability it occurs in B minus the probability 
that it occurs in both A and B. 
The situation is illustrated in Fig. 1, where the outcomes of a chance 
event are represented by points in a plane and two subsets are enclosed 
by two circles A and B. Certain outcomes fall in the lenticular region 
common to both circles, and in adding the outcomes in both circles, 
these points are counted twice and must therefore be subtracted once. 
Symbolically, the additive law is 


P(A or B) = P(A) + P(B) — P(A, B) (4) 
We may generalize this law to account for more than two subsets; 
thus 


P(A or B or C) = P(A) + P(B) + P(C) — P(A, B) — P(A, C) 
i — P(B, C) + P(A, B,C) (5) 
21 


TWO BASIC LAWS OF PROBABILITY §2.8 
as is easily verified by drawing a figure similar to Fig. 1 in which three 
circles intersect so as to have a region common to all three. The gen- 
eral law for h subsets, which may be proved by induction, is 

x . 
P(A, or Az + + + or Ax) = Y, P(A) — Y P(A; AD) 
i=l ij 
sis 2, Pus Aj, Ar) — ++ EPA As +> +, Ax) (0) 
4)! 


where the second sum is over all combinations of the numbers 1, 2, 
+++, h taken two at a time, the third is over all combinations of the 


Fia. 1. 


numbers taken three at a time, and so forth. If all the subsets are 
mutually exclusive, then all the probabilities in the sums beyond the 
first sum are zero, and (6) reduces to (2). 

We have essentially derived the multiplicative law of probabilities 
in defining conditional probability in the preceding section. 

If some of the outcomes of a chance event can have both the attributes A 
and B, the probability of such an occurrence is equal to the probability 
of A multiplied by the conditional probability of B given that A has 
occurred, or it is equal to the probability of B multiplied by the conditional, 
probability of A given that B has occurred. 

In symbols, 
P(A, B) = P(A)P(BIA) (7) 
= P(B)P(A|B) (8) 


We may refer to the model given in preceding section, or we may use 

the model of Fig. 1. Let n be the number of points in Fig. 1; let mi 

be the number of points in A (including those common to B), mz be 

the number in B, and m; be the number common to A and B. Then 
29 


§2.9 PROBABILITY AND COMBINATORIAL METHODS 


P(A, B) = 72 
P(A) = 7 
P(B) =" 

P(A|B) = = 

P(B|A) = ?* 


whence (7) and (8) follow directly. 
In general we may show by induction that 


P(Ay As, ` - + , Ay) = P(A)P(AsA)P(As| Ai, As)P(A4 As, Aa, As) 
e PA, As 7, Ans) (9) 


and there are A! such relations which may be obtained by permuting 
the letters in the right-hand side of (9). The two relations for h = 2 
are given by (7) and (8). 

2.9. Compound Events. The multiplicative law of probabilities is 
particularly useful in simplifying the computation of probabilities for 
compound events. A compound event is one that consists of two or 
more single events as when a die is tossed twice, or three cards are 
drawn one at a time from a deck. The following simple example will 
illustrate the method: 

Two balls are drawn, one at a time, from an urn containing two 
black, three white, and four red balls. What is the probability that 
the first is red and the second is white? (The first is not replaced 
before the second is drawn.) 

The outcomes of this compound event may be classified according 
to two criteria: the color of the first ball, and the color of the second 
ball. We may therefore construct a table like that at the beginning 
of Sec. 7. The A classification corresponds to the color of the first 
ball, and we shall let 4, A», As correspond to the colors black, white, 
and red, respectively. Similarly the classes Bı, Bs, Ba will correspond 
to the same colors for the second ball. The total number of outcomes 


isn =9 X8 =72. It is not () = 36, because we are considering 


permutations, not arrangements; i.e., we are not asking that one ball 
be red and one white; we require that the colors appear in a specific 
30 


COMPOUND EVENTS §2.9 


order. The complete table of outcomes is 


and the probability asked for in the problem is 
P(As, By) = 1242 = 1$ 


By using the multiplicative law of probabilities, we need only con- 
sider the two separate events one at a time. Here we must use the 
law in the form 

P(As, Bz) = P(As)P(B2|As) 
Now P(A;) is simply the probability of drawing a red ball in a single 
draw, which is 44, and P(B2|As) is the probability of drawing a white 
one, given that a red one has already been drawn, which is 3g. The 
product of these two numbers gives the required probability 

P(As, Bo) = 46 X ?6 = Y6 

The validity of the above technique is not obvious. It is not 
immediately evident that the marginal probability P(As) can be com- 
puted by completely disregarding the second event, nor that the 
conditional probability corresponds to the simple physical event 
described above. 

For a compound event consisting of two single events we need only 
consider a 2 X 2 table. Let A; correspond to a success on the first 
event, and As correspond to a failure, and let m; be the number of 
ways the first event can succeed, and m» be the number of ways it can 
fail. Let Bı and By be similarly defined for the second event. Let 
mıı and mız be the numbers of ways the second event can succeed and 
fail if the first succeeds, and let ms; and məz: be the number of ways 
the second can succeed or fail if the first event fails. The 2 X 2 table 


18 


Ai | mumu | mami» 


En 


Tisi» 


mamsi 


31 


§2.9 PROBABILITY AND COMBINATORIAL METHODS 
The total number of possible outcomes is 
= mmn + mamas + Mamas + MM2 
_ The required probability is 
P(Ay B) = T (1) 
„The marginal probability P(A) is 


MıMır y mimMi2 mi(mu + mas) (2) 
n n mmu + mae) + ma(mai + Ma2) 


Now the probability of a success on the first event without regard to 
the second is simply mi/(ma + ma), which is not equal to the above 
expression unless 


mu + Mi = ome Tomas 


i.e., unless the total number of outcomes for the second event is the 
same regardless of whether or not the first event is a success. The con- 
ditional probability is mu/(mu -+ mis) and gives the probability of a 
success for the second event under the assumption that the first was a 
success. 

We might be inclined to conclude that the conditional-probability 
approach is correct only if the number of outcomes for the second event 
is independent of the outcome of the first event. Precisely the 
opposite is true. The correct probability is 


mi Mar 


PCAs B) = s. a mn F ma E 


and not the value mımıı/n given in equation (1). 

The value computed by the conditional approach is always correct, 
while that computed by enumeration of outcomes is correct only if the 
number of outcomes for the second event is independent of the outcome - 
of the first event. 

A simple example will clarify the situation. Suppose a coin is tossed, 
and if a head appears, a black ball is placed in an urn, while if a tail 
appears, a black ball and a white ball are placed in the urn. Then a 
ball is drawn from the urn. If a head is tossed, the ball will neces- 
sarily be black. Using H, T, B, W to represent heads, tails, black, and 
white, the three possible outcomes of this sequence are HB, TB, TW. 


These three outcomes are clearly not equally likely. If the experiment 
32 


COMPOUND EVENTS §2.9 


were repeated a number of times, we should expect the outcome HB 
to occur twice as often as either of the other two. P(HB) = 14, not 
M 

In general, the possible outcomes of a compound event are nof 
equally likely if the number of outcomes of the second event depends 
on the outcome of the first; hence the definition of probability is not 
applicable. However, if the definition can be applied to the constitu- 
ent events separately, then it is possible to compute the probability 
of the compound event by using the method of conditional probabili- 
ties. Unfortunately, it is not possible to give a formal proof of these 
statements. We must simply rely on our intuition, or rather on the 
import of whatever experimental. evidence we may possess. Such 
evidence may be obtained, for example, by performing the above- 
described experiment a number of times. 

Illustrative example: To illustrate further the method of conditional 
probabilities, let us compute the probability that of five cards drawn 
from an ordinary deck, exactly two will be aces. 

We shall suppose the deck consists of four A’s, representing aces, and 
48 N’s, representing not aces. To use conditional probabilities, we 
must assume the five cards are drawn one at a time, and we must 
assume a particular order such as A, A, N, N, N. We shall use equa- 


tion (8.9) with h = 5. 


P(A, A, N, N, N) 
= P(A)P(AIA)P(N|A, A)PCN|A, A, N)P(N|A, A, N, N) 


Now P(A) = 4$5; with one ace removed from the deck, P(A|A) = 361; 
with two aces removed from the deck, P(N|A, A) = 4360. Proceed- 
ing thus, 


P(A, A, N, N, N) = 462 X 961 X 4860 X “o X 4048 


This is the probability for the given order, but the problem did not 
specify any order, so we must consider all possible orders. "There are 
51/(2!3!) = 10 permutations of two A's and three N’s, so we have 
10 probabilities to evaluate, and the required probability, by the 
additive law, is the sum of these 10 probabilities. It is soon apparent, 
however, that all the probabilities are equal. Thus, for example, 


P(N, A, N, N, A) = 486 X $61 X 4760 X 4949 X 748 


which is the same as the above number except that the numerdtors are 
permuted. Clearly this will be the case for all permutations. Hence 
33 


— 
82.9 PROBABILITY AND COMBINATORIAL METHODS 


the required probability is 
10 X 4 X 3 X 47 X 46 
10P(4, A, N, N, N) = -53 X 51 X 50 X 49 
= .0399 


Independent Events. Tf the conditional probability P(B|A) is equal 
to the marginal probability P(B), the events A and B are said to be 
independent. The outcome of B is not influenced in any way by A. 
Thus a die may be tossed twice, and we may seek the probability that 
the results will be two and three in that order 


P(2, 3) = P(2)P@I2) = PQ)PQ) = 6 X % 


In the illustrative example involving two aces in five cards, the five 
constituent events of the compound event will be independent if we 
require that each card drawn be replaced in the deck and the deck 
shuffled before the next card is drawn. The probability that the 
second card will be an ace is then 44» instead of 251. The probability 
that two aces will appear when five cards are drawn with replacement 
is 
10(442)2(4852)* = .0465 

In general, 

If the constituent events of a compound event are mutually independent, 
the probability of the compound event is equal to the product of the probabil- 
ities of the constituent events. 

We may write this in the form 


h 
P(Ay An * ::, 4) = M PA) (4) 
i=1 


provided that 
P(A) = P(A: A4; =+" Ap) fom alid: gy se ese, 


It is important to remember that this probability is the probability of 
occurrence of the separate events in a specific order. 

The additive law of probability given by equation (8.6) can also be 
used to simplify materially certain problems in compound events. A 
striking example is provided in the following: 

Illustrative example: Six cards are drawn with replacement from an 
ordinary deck. What is the probability that each of the four suits 
will be represented at least once among the six cards? 

We shall solve the problem by finding first the probability that all 
the suits do not appear. Let A symbolize the appearance of all the 
suits, and B symbolize the nonappearance of at least one of the suits. 

34 


COMPOUND EVENTS 82.9 
Since either A or B is certain to happen, 
P(AorB)-1 
and since A and B are mutually exclusive, 


P(A or B) = P(A) + P(B) 21 
and 
P(A) = 1— P(B) 


Thus, if we can find P(B), P(A) can be determined at once. 

To get P(B), let us classify the possible outcomes favorable to B into 
four sets: B, is the set of all outcomes in which spades are absent; Bs 
is set for which hearts are absent; Bs, diamonds absent; B4, clubs 
absent. These sets are overlapping; an outcome which consists of 
only spades and hearts falls in B; and in By. Clearly 


P(B) = P(B, or Bs or Bs or Bs) 


and employing equation (8.6) 
P(B) = ZP(B) — XP(B, Bj) + EP(B;, Bj, By) — P(Bi, Bs, By, B) 
in which the sums are taken over all combinations of the subscripts. 
The probability P(B;) that a spade will not appear in the six draws is 
(34)8, and the value is the same for all B;; hence 
=P(B;) = 4(34)* 
The probability P(B;, B) that neither spades nor hearts will appear 
in the six draws is (14) and is the same for all six pairs of the four suits 
taken two at a time; hence 
ZP(B; Bj) = 60.4) 
Similarly 
ZP(B, B; By) = 404)* 
and 
P(Bi, Bs, Bs, Bs) = 0 
since the simultaneous nonappearance of every suit isimpossible. ‘The 
required probability is, therefore, 
P(A) = 1 — 4(949* + 604)* — 404)* 
381 
A slight alteration of this example will illustrate another useful 
technique. 


[54 


35 


§2.10 PROBABILITY AND COMBINATORIAL METHODS 


Illustrative example: Cards are drawn one at a time with replacement 
from an ordinary deck until all suits have appeared at least once. 
What is the probability that six draws will be required? 

Referring to the preceding example, let P, denote the probability 
that all suits will be represented at least once if cards are drawn. 
Clearly 

P, = 1 — AQ" + 604) — 400 
Now suppose we knew the answer to the present problem for a general 
value of n. Let p, denote this probability (that exactly n draws will 
be required to produce all the suits). 

If n cards are drawn, the first appearance of each suit at least once 
may occur on the fourth draw, or the fifth, or the sixth, and so forth. 
Since these outcomes are mutually exclusive, we have 


ERCE pat pst pet 00 b 
From this relation we conclude that 


Pr = Ba Prat 
and in particular that 
ps = 1 — 4(34)° + 604)* — 4(14)* — [1 — 4(94)* + 604)" — 4015] 
= (94)* — 304)* + 304)* 

> 147 

2.10. A Priori and Empirical Probabilities. In introducing the 
theory of probability we have relied heavily on the combinatorial 
definition given in the first section of the chapter. However, we have 
seen that this approach has severe limitations, and the question arises 
as to how useful such a theory may be. 

A theory of statistics based on a priori probability would indeed 
have very limited usefulness. While there are a few practical situa- 
tions in which such a theory could be used (the field of genetics provides 
one important area), the great majority of fields of application occur 
where a priori probabilities do not exist. Our theory must be general- 
ized, and we shall do it quite arbitrarily. We shall simply assume the 
existence of certain probabilities, and we shall assume that they obey 
the same laws as do combinatorial probabilities. We may consider 
the coin, mentioned earlier, which is known to be loaded in favor of 
heads. We shall assume that there is a number which gives the correct 
probability of a head, though one cannot say what the number is. 
We can, however, estimate the number. We may toss the coin a 
large number of times and divide the number of heads by the total 

36 


A PRIORI AND EMPIRICAL PROBABILITIES §2.10 


number of tosses. This if 62 heads appear in 100 tosses, we would 
estimate the probability to be .62. This estimate is called an empirical 
probability. We shall not make the error of stating that the correct 
probability of à head is .62, because we know that if the coin were 
tossed 100 times again, the number of heads might well differ from 62. 
The empirical probability is merely an estimate of what we think of 
as the true probability. We shall see later that the estimate can be 
made more and more accurate by increasing the number of trials in the 
experiment. 

We may observe that we do not need to postulate the existence of a 
probability for every imaginable situation. We may as well limit 
ourselves to operationally meaningful situations. That is, we shall 
not assume the existence of a probability unless it is possible to set up 
an experiment by means of which the assumed probability can be 
estimated. "Referring to the question, mentioned in the first section, of 
drawing an even number from the whole set of positive integers, we 
do not need to assume that such a probability exists. For there is no 
way to estimate it; we cannot build an urn large enough to hold 
balls numbered 1, 2, 3, + - - ad infinitum or even procure the balls. 
Clearly this kind of limitation in the theory will not limit its practical 
application. 

Our position then is this: We develop the theory by thinking about 
ideal coins, ideal dice, ideal random drawings from an urn, and so 
forth. And we admit the existence of probabilities which have no 
a priori basis, provided they can be estimated. We speak of the 
probability of a head being one-half when a coin is tossed. But 
faced with an actual coin, we refuse to say what the probability of a 
head is. If the coin appears homogeneous and fairly symmetrical, we 
may guess that the probability is somewhere near one-half, but we 
shall not be surprised if a long series of trials indicates that the prob- 
ability is somewhere between .57 and .58, for example. We shall not 
hesitate to make statements of the following kind: whatever the prob- 
ability p may be, the probability of a tail is 1 — p, the probability of 
two heads when the coin is tossed twice is p?, the probability of a head 
and a tail in either order when the coin is tossed twice is 2p(1 — p), and 
soforth. Thus, we shall use our laws of probability on p. 

The justification for these assumptions (that noncombinatorial prob- 
abilities exist, and that they obey the same laws as combinatorial 
probabilities) is simply that they work. A great mass of experimental 
evidence supports the assumptions, while no evidence has ever been 


brought forward which seriously controverts them. 
37 


82.11 PROBABILITY AND COMBINATORIAL METHODS 


2.11. Notes and References. The development of the theory of 
probability began in the seventeenth century and has continued stead- 
ily to the present day. It is therefore an old and now fairly extensive 
branch of applied mathematics. The subject had its origin in games 
of chance, but it brought forth such a variety of interesting problems 
that many eminent mathematicians were attracted toit. Today there 
islikely more work being done in this field than ever before, and this is 
due in large part to the rapid developments in statistics. 

An excellent modern textbook on probability theory is J. V. Uspen- 
sky, “Introduction to Mathematical Probability," McGraw-Hill Book 
Company, Inc., New York, 1937. 


2.12. Problems 


1. An urn contains three white balls and seven black ones. What 
is the probability that one drawn at random will be white? 
2. If two coins are tossed, what is the probability that a head and 
a tail will appear? 
3. If a three-volume set of books is placed on a shelf in random 
order, what is the probability that they will be in the correct order? 
4, What is the probability of obtaining three heads if three coins 
are tossed? What is the probability that at least two heads will 
appear? 
b. An urn contains three white balls and two black ones. What is 
the probability that two balls drawn from the urn will both be white? 
6. How many three-digit numbers can be formed with the integers 
1, 2, 8, 4, 5, if duplication of the integers is not allowed? If duplication 
is allowed? 
T. How many three-digit numbers can be formed from 0, 1, 2, 3, 4 
if duplication is not allowed? How many of these are even? 
8. In how many ways can a committee of three be chosen from 
nine men? 
9. There are five roads from A to B and six roads from B toC. In 
how many ways can one go from A to C via B? 
10. How many different sums of money can be formed with one each 
of the six kinds of coins minted by the United States Treasury? 
11. In how many ways can six girls and four boys be divided into 
two groups of two boys and three girls? 
12. In a baseball league of eight teams, how many games will be 
necessary if each team is to play every other team twice at home? 
13. How many football teams can be formed with 12 men who can 
play any line position and 8 men who can play any back position? 
38 


PROBLEMS 82.12 


14. How many signals can a ship show with five different flags if 
there are five significant positions on the flagpole? 

15. How many license plates can be made if they are to contain five 
symbols, the first two being letters and the last three integers? 

16. How many diagonals are there in a twelve-sided polygon? 

17. How many dominoes are there in a set from double 0 to double 
12? 

18. What is the probability of getting a seven with a pair of dice? 

19. What is the probability that two cards drawn from an ordinary 
deck will be spades? 

20. What is the probability that a five-card hand will contain exact! y 
two aces? At least two aces? 

21. What is the probability that a bridge hand will be a complete 
suit? 

22. An urn contains four white, five red, and sixblack balls, Another 
contains five white, six red, and seven black balls. One ball is selected 
from each urn. What is the probability they will be of the same color? 

23. Show that (") = ( s 

hia n-—Xx 

24. In how many ways can n different objects be divided into k 

groups containing mi, ns, * * * , n; objects, if 
Tic nad tt: +m =n — m? 

25. An urn contains m white and n black balls. X balls are drawn 
and laid aside, their color unnoticed. Then another ball is drawn. 
What is the probability that it is white? 

26. Six dice are tossed. What is the probability that every possible 
number will appear? 

27. Seven dice are tossed. What is the probability that every 
number appears? 

28. What is the probability of getting a total of five points with three 
dice? 

29. An urn contains ten balls numbered from one to ten. Four 
balls are drawn, and suppose z is the second smallest of the four num- 
bers drawn. What is the probability that z — 3? 

30. If n balls are tossed into k boxes so that each ball is equally 
likely to fall in any box, what is the probability that a specified box will 
contain m balls? 


31. Show that $ CX:= C x Xi. 
ic i= 


=1 
32. Show that [] CX; = c^ (II X.)% 
i=1 ic 
39 


§2.12 PROBABILITY AND COMBINATORIAL METHODS 


33. Show that $ X) = y Y XX. 
i=l t=1j=1 


2n+1 n 

34. Show that [| (X +n +1- 1) = JJ (x? -—33. 

i=1 i=1 

35. Find the coefficient of z*y? in the expansion of the binomial 
(x? — ay). 

36. Find the coefficient of z?y?z* in the expansion of the trinomial 
(2z — y — z)". 

37. If six balls are tossed into three boxes so that each is equally 
likely to fall in any box, what is the probability that all boxes will be 
occupied? 

38. The corners of a regular tetrahedron are numbered one, two, 
three, four. Five tetrahedra are tossed. What is the probability 
that the sum of the upturned corners will be 12? 

39. The spades and hearts are removed from a deck of cards and 
placed face up in a row. The remaining cards are shuffled and dealt 
face up in a row beneath the row of spades and hearts. What is the 
probability that all the clubs will be beneath spades? What is the 
probability that among the 26 pairs of cards, 16 pairs will consist of 
cards of the same color? 

40. Six cards are drawn from an ordinary deck. What is the prob- 
ability that there will be one pair (two aces, or two fives, for example) 
and four scattered cards? That there will be two pairs and two 
scattered cards? 

41. The face cards are removed from an ordinary deck and the 
remainder divided into the four suits. A card is drawn at random from 
each suit. What is the probability that the total of the four numbers 
drawn is 20? 

42. An urn contains three black balls, three white ones, and two 
red ones. Three balls are drawn and placed in a black box, then three 
more are drawn and placed in a white box, and the remaining two are 
put in a red box. What is the probability that all but two of the balls 
. will fall in boxes corresponding to their colors? 

43. An urn contains four white and five black balls; a second urn 
contains five white and four black ones. One ball is transferred from 
the first to the second urn; then a ball is drawn from the second urn. 
What is the probability it is white? 

44. In the above problem suppose two balls, instead of one, are 
transferred from the first to the second urn. Find the probability that 
a ball then drawn from the second urn will be white. 

40 


PROBLEMS §2.12 


45. If it is known that at least two heads appeared when five coins 
were tossed, what is the probability that the exact number of heads 
was three? 

46. If a bridge player has seven spades, what is the probability that 
his partner has at least one spade? At least two spades? 

47. If a bridge player and his partner have eight spades between 
them, what is the probability that the other five spades are split 
three and two in the opposing hands? 

48. A bridge player and his partner hold all spades except K, 8, 2. 
What is the probability that they are split K and 3, 2 in the opposing 
hands? What is the probability that K or K, 2 or K, 3 or K, 3, 2, 
appears in a specified one of the two opposing hands? 

49. A person repeatedly casts a pairof dice. He wins if he casts an 
eight before he casts a seven. What is his probability of winning? 
Nore: 1 +e +r Herp- 1/(1.— 5), if |z| « 17 

50. In a dice game a player casts a pair of dice twice. He wins if 
the two numbers thrown do not differ by more than two with the 
following exceptions: if he gets a 3 on the first throw, he must produce 
a 4 on the second throw; if he gets an 11 on the first throw, he 
must produce a 10 on the second throw. What is his probability of 
winning? 

51. The game of craps is played with two dice as follows: In a par- 
ticular game one person throws the dice. He wins on the first throw 
if he gets 7 or 11 points; he loses on the first throw if he gets 2, 3, or 
12 points. If he gets 4, 5, 6, 8, 9, or 10 points on the first throw, he 
continues to throw the dice repeatedly until he produces either a 7 
or the number first thrown; in the latter case he wins, in the former he 
loses. What is his probability of winning? 

52. In simple Mendelian inheritance, a physical characteristic of a 
plant or animal is determined by a single pair of genes. The color of 
peas is an example. Letting y and g represent yellow and green, peas 
will be green if the plant has the color-gene pair (g, g); they will be 
yellow if the color-gene pair is (y, y) or (y, g). In view of this last 
combination, yellow is said to be dominant to green. Progeny get one 
gene from each parent and are equally likely to get either gene from 
each parent’s pair. If (y, y) peas are crossed with (g, g) peas, all the 
resulting peas will be (y, g) and yellow because of dominance, If (y, g) 
peas are crossed with (g, g) peas, the probability is .5 that the resulting 
peas will be yellow and is .5 that they will be green. In a large number 
of such crosses one would expect about half the resulting peas to be 


yellow, the remainder to be green. In crosses between (y, g) and (y, g) 
41 


$2.12 PROBABILITY AND COMBINATORIAL METHODS 


peas, what proportion would be expected to be yellow? What pro- 
portion of the yellow peas would be expected to be (y, y)? 

53. Peas may be smooth or wrinkled, and this is a simple Mendelian 

character. Smooth is dominant to wrinkled so that (s, s) and (s, w) 
peas are smooth while (w, w) peas are wrinkled. If (y, g) (s, w) peas 
are crossed with (g, g) (w, w) peas, what are the possible outcomes and 
what are their associated probabilities? For the (y, g) (s, w) by 
(g, g) (s, w) cross? For the (y, g) (s, w) by (y, g) (s, w) cross? 
. 54. Albinism in human beings is a simple Mendelian character. 
Let a and n represent albino and nonalbino; the latter is dominant, so 
that normal parents cannot have an albino child unless both are (n, a). 
Suppose that in a large population the proportion of n genes is p and 
the proportion of a genes is q = 1 — p, so that g? of the individuals 
arealbinos. Assuming that albinism is not a factor in the selection of 
marriage partners or in the number of children of a particular marriage, 
what proportion of individuals of the next generation would be 
expected to be albinos? If albinos married only albinos and had as 
many children on the average as nonalbinos, what proportion of indi- 
viduals in the next generation would be expected to be albinos? 
What would happen eventually to the population if albinos continued 
generation after generation to mate only with albinos (assume num- 
ber of individuals in each generation is the same)? 

55. It is known that an urn was filled by casting a die and putting 
white balls in the urn equal in number to that obtained on the throw 
of the die. Then black balls were added in a number determined by a 
second throw of the die. It is also known that the total number of 
balls in the urn is eight. What is the probability that the urn contains 
exactly five white balls? 

56. Urn A contains two white and two black balls; urn B contains 
three white and two black balls. One ball is transferred from A to B; 
one ball is then drawn from B and turns out to be white. What is 
the probability that the transferred ball was white? 

57. Each of six urns contains 12 black and white balls; one has 8 
white balls, two have 6 white balls, and three have 4 white balls. An 
urn is drawn at random, and three balls are drawn without replacement 
from that urn. Two of the three are white; the other is black. What 
is the probability that the urn drawn contained 6 white and 6 black 
balls? 

58. Three newspapers, A, B, C, are published in a certain city. It 
is estimated from a survey that of the adult population: 


42 


PROBLEMS §2.12 


20% read A 

16% read B 

14% read C 
8% read both A and B 
5% read both A and C 
4% read both B and C 
2% read all three 


What percentage reads at least one of the papers? Of those that read 
at least one, what percentage reads both A and B? 

59. Twelve dice are cast. What is the probability that each of the 
six faces will appear at least once? 

60. A die is cast repeatedly until each of the six faces appears at 
least once. What is the probability that it must be cast ten times? 


CHAPTER 3 
DISCRETE DISTRIBUTIONS 


3.1. Introduction. In Chap. 2 we were concerned with finding the 
probability of a specific outcome for a certain chance event. In this 
chapter we shall be concerned with a complete set of probabilities. A 
simple example will introduce the idea. What is the probability that 
x heads will appear if four coins are tossed? Denoting the probability 
by f(x) (this is the functional notation): 


() 
Ie) = 5A 0<r<4 (1) 
We have a function which tells us directly what the probability is 
for any value of z in its possible range, which is zero to four inclusive. 
The function gives the complete set of probabilities for the given char- 
acter (number of heads). We may calculate the function by giving x 
each of its possible values, and we may then plot the function, as in 
Fig. 2, using vertical lines of length equal to f(x) on some scale. Since 
one of the values of x is certain to occur, the sum of the set of probabili- 
ties must be one, because the probability of zero or one, or two, or 
three, or four heads, is equal to the sum of the separate probabilities. 


4 
PUO =i! (2) 


The function of f(z) is called a discrete probability density function, or 
distribution function. We shall usually refer to it more briefly as 
simply a density or a distribution. It is useful to think of f(x) as giving 
the relative frequency of occurrence of the separate values of z. Thus, 
suppose the four coins were tossed a very large number of times. We 
should expect no heads to appear (x = 0) in about one-sixteenth of the 
tosses; we should expect one head to appear (œ = 1) in about one- 
fourth of the tosses, and so forth. The graph of the density makes a 
number of things immediately evident: that the most likely number of 
heads is two, that one head ean be expected to occur about four times 
as often as no heads, that three heads can be expected to occur about 

44 


INTRODUCTION §3.1 


as often as one head, and so forth. The word “about” is used because 
we are familiar with the fluctuations that accompany chance events. 
Thus, if a single coin is to be tossed ten times, we expect five heads 
and five tails on the average, but actually some other division of heads 
and tails is quite likely to occur in a given trial. 


* Fie. 2. 


The results of an actual experiment in tossing four coins are given 
in the following table. Four coins were tossed 160 times and the 


number of heads counted on each toss. 


ResuLrs or Tosstye Four Cors 160 Tes 


Number| Actual Expected 
of heads | occurrences | occurrences 


0 6 10 
1 41 40 
2 56 60 
3 45 40 
4 12 10 

160 160 


The agreement between actual and expected occurrences is none too 
good (it is to be remembered that the probability of a head may not 
have been exactly one-half for each of the four coins actually used), but 
still the general character of the distribution of actual outcomes was 
fairly well indicated by the distribution function f(z). 

1 45 


§3.2 DISCRETE DISTRIBUTIONS 


Knowing the density function of some attribute z, we can supply 
the answer to any probability question pertaining to x. Thus, refer- 
ring again to our particular example, the probability of two heads is 


(2 

2 3 
P(r-2)—/02)--5--73 

'The probability that the number of heads will be less than three is 


2 
P(e <3) = Y f(x) = 14s 
z-0 


The probability that the number of heads will be between one and 
three inclusive is 


3 
Pd <2z<3)= ) f@ = 7s 
z-1 


Given that the number of heads on a specifie outcome is less than four, 
the conditional probability that the number is not more than two is 


2 

Y f@) ; 
P(e < 2e < 4) = *5? =5 

2, fG) 


z-0 


= 


The symbol P(- - -) will always be used as it has been used here and 
may be read "the probability that . . . .” Thus in the last equation, 
the symbol represents this phrase: the probability that x is less than 
or equal to two given that x is less than four. A vertical bar used in 
the symbol will always mean “given that” or “when it is known that” 
and will precede the specified condition of a conditional probability. 

3.2. Discrete Density Functions. The essential properties of dis- 
crete density functions have already been suggested in the preceding 
section, and we need only to describe them in somewhat more general 
language. 

The set of possible outcomes of a chance event are classified into a 
number, say k, of mutually exclusive classes according to some attri- 
bute. Associated with each class is a value of a random variable, or 
variate, x. The density function is a function of « which gives the 
probability that any specified value of z will occur. 

The variate x may naturally describe the attribute, as was the case 
in the coin-tossing illustration, or it may simply be a code. Thus in 

46 


MULTIVARIATE DISTRIBUTION §3.3 


drawing balls from an urn, the classification may be according to color. 
We could define a random variable x by arbitrarily setting a corre- 
spondence between values of x and colors: 7 = 1 corresponds to black; 
x = 2 corresponds to red; and so forth. When a red ball is drawn, the 
variate has the value two. 

The density function may be a mathematical expression involving z, 
as was the case in the preceding section, or it may be only a table of 
values. Thus if an urn contains three black, two red, and five white 
balls, we may code the colors 1, 2, 8, respectively, and find the proba- 
bilities .3, .2, and .5. We do not bother to construct a mathematical 
expression which will take on these values when z is put equal to 1, 2, 
and 3, but merely tabulate the function: * 


Tir 2 3 
dita) PUE] 


The word discrete is used to distinguish the variate from continuous 
variates, which will be discussed in the next chapter. A variate x is 
discrete if it can take on only isolated values, i.e., if successive possible 
values of x are separated on the zz axis, The distinction will be brought 
out in more detail in the next chapter. 

The set of probabilities represented by a density function will always 
have a sum equal to one because we shall speak of a density only when 
(1) all the possible outcomes are included among the separate classes 
of outcomes, (2) the classes are mutually exclusive. 

3.3. Multivariate Distribution. When the outcome of a chance 
event can be characterized in more than one way, the probability 
density function is a function of more than one variable. Thus when 
a card is drawn from an ordinary deck, it may be characterized accord- 
ing to its suit and to its denomination. Let x = 1, 2, 8, 4 correspond 
to the suits in some order (say, spades, hearts, diamonds, clubs), and 
lety = 1,2,3,--- , 13 correspond to the denominations, A, 2, - - - A 
10, J, Q, K. The probability of drawing a particular card will be 
denoted by f(z, y) and clearly 


Je u= ls 1<4¢<41<y< 13 (1) 


This function may be plotted over a plane as in Fig. 3; the probabili- 
ties are represented by vertical lines at the points (x, y) in the hori- 
zontal plane where the probabilities are defined. In this case, since 
the function is a constant, the lines are of equal height. 

To consider another example: Let four balls be drawn from an urn 


containing five black, six white, and seven red balls. Let x be the 
47 


83.3 DISCRETE DISTRIBUTIONS 


f(«y) 


B- 0. 10- W R IS 


JVVVVVVVVV VV 
JV VVVVVVVVVVVY 


JV VVVVVVVVVVVYV 
VV VVVVVVVVVVV 


Fia. 3. 


x 


Fie, 4. 


number of white balls drawn and y be the number of red balls drawn. 
The density is 


ma- DON id O<at+y<4 (2) 
4 


and its graph is shown in Fig. 4. In this example, we might consider 
defining a third random variable, z, to be the number of black balls 
48 


SÉ 


MULTIVARIATE DISTRIBUTION 83.3 


drawn, and obtain a trivariate distribution, But z is exactly deter- 
mined by x and y since z = 4 — z — y. No new information can be 
obtained by adding z to the set of random variables characterizing the 
outcomes, and, in fact, if z were included in the distribution funetion, 
the set of probabilities represented by that function, f(z, y, 2), would 
be exactly the same set that we have already obtained using z and y. 


Fio. 5. 


A simpler example of functional dependence is that of tossing a coin, 
say four times, Let x be the number of heads and y be the number of 
tails, Since x + y must be equal to four, the variables are functionally 
dependent; knowing one, the other is exactly determined, The 


density is Pa () (y oi 24y-4 


d its graph is given in Fig. 5. It gives us no more information than 
the function used as an example in Sec, 1; the set of probabilities is 
exactly the same as before. . 
We have used the terms dependent and independent in two entirely 
different connections. In Chap. 2 we defined two events to be inde- 
49 


§3.3 DISCRETE DISTRIBUTIONS 


pendent if the conditional probability of one, given the other, was 
equal to the marginal probability of the first. We shall in the future 
refer to this kind of independence as independence in the probability 
sense. Returning to the urn example: z and y are functionally inde- 
pendent (since y is not uniquely determined when z is known), but they 
are dependent in the probability sense (as we shall see). 

In the urn example, the marginal density of x is found by applying 
the definition in Sec. 2.7 (i.e., Sec. 7 of Chap. 2), and is 


Sten - a 0sas4 e) 
ERU 


The sum may be performed by means of an algebraie identity, but 
here it is simpler to consider the problem anew as one involving 6 white 
balls and 12 that are not white. Similarly the marginal density of y is 


(5) 
v y/ \4 = V 
= = ty O<y<4 4 
f) 2. f(a, y) : Sys (4) 
4 
This function is plotted in Fig. 6. The height of the line at y = 0, 
which represents f(0), is equal to the sum of the lengths of the vertical 
lines along the x axis in Fig. 4; f(1) is the sum of the lengths of the 
vertical lines along the line y = 1 in Fig. 4, and so forth. 
The conditional density of x, given y, is defined exactly as in Sec. 2.7 
and is denoted by 
f(x, y) 
Hera 


) 


Similarly 


If x were given some specific value, say x = 1, we could plot the density 
50 


MULTIVARIATE DISTRIBUTION ` 83.3 


fyll) by giving y its successive values: 0, 1, 2, 3. The vertical lines 
would have the same relative heights as those along the line z — 1 in 
Fig. 4; their lengths would be inereased by the factor 1/f(x) evaluated 
for æ = 1 so that the sum of their lengths would be one. We observe 
that f(y|z) is not equal to the marginal distribution of y, 80 that y and x 
are not independent in the probability sense. Of course, the fact that 
J(y|z) involves z is sufficient evidence that the two variates are depend- 
ent in the probability sense. If, however, we had an example in which 


^(y 


0.25 


0 2 3 4 y 
Fia. 6. 


K(ylz) did not involve z, it would still be possible for the two variates 
to be dependent because the range of y might depend on x. If both 
K(y|z) and the range of y do not involve æ, then the two variates will 
obviously be independent in the probability sense. 

As an example of a distribution involving several variates, suppose 
12 cards are drawn without replacement from an ordinary deck, and 
let z; be the number of aces, xz be the number of deuces, vs be the 
number of treys, and x, be the number of fours. The distribution of 
these variates is given by a function of four variates and is, in fact, 


4) ( AN (4 A 36 ) 
Tı, Ce, 3, Ta 12 — X1 ee ae E Taf 
52 
12, 
where the range of each variate is 0 < z; < 4 subject to the restriction 
51 


S(@1, 23, 23, 24) = 


2 


§3.3 DISCRETE DISTRIBUTIONS 


that Ez; < 12. There are a large number of marginal and conditional 
distributions associated with this distribution; a few examples are 


3 4 44 
to} Nrs] N12 — t2 — t3 O<x%<4 


f(x», 235) = 


(8) T2 233 X 8 
12 

E.) 
fle) wm SNe O<a<4 


Flea waler) 


TE 
— Mee} NJ \12 — 2; — T2 — T3 — 4 O0<a<4 
( 44 ) moto, < 12 — ay as 


12— 21-2; 


the first two being marginal distributions and the third a conditional 
distribution. The distribution f(a, X», 2s, 24) itself may in this case 
be regarded as a marginal distribution of some more detailed distribu- 
tion, for example, the six-variate distribution of zi, x», Lay X4, Ls, Ze, 
where zs and xs are the numbers of fives and sixes that appear among 
the 12 cards drawn. 

' We cannot plot the four-variate distribution; in fact, we have used 
all three dimensions of conceptual space in plotting bivariate distribu- 
tions. This could have been avoided by using a different device; we 
might have used dots of different sizes rather than vertical lines and 
thus pictured the bivariate distributions in two dimensions. This 
method would not have given as clear a representation of the relative 
magnitude of the probabilities. Using the dots, we could get a pic- 
torial representation of a trivariate distribution, but for more than 
three variates no simple graphical representation is possible. 

The probability that random variables will fall in any region of their 
space is obtained by summing the density function over all points in 
the region. Suppose a bivariate density f(x, y) is defined for « = 0, 1, 
2, =- randy —0,1,2, ---,s. The probability that z < 5 and 
y € 3 is obtained by summing f(z, y) over the region defined by the 
inequalities (the rectangle in Fig. 7). 


4 3 
P(e <5, y < 3) = D 2} Se, y) 
52 


MULTIVARIATE DISTRIBUTION §3.3 


y 
s Cg ta Ede UP 4 
I 
! 
I 
6 ZR i 
^ x 


Fra. 7. 


The probability that the sum of z and y is less than 5 is equal to the 
sum of f(x, y) over all points within the triangle bounded by the line 
t+y= 65, 
Pe +y <5) = f(0, 0) + f(1, 0) + f(2, 0) + f(3, 0) + f(A, 0) 

+ f(0, 1) + fA, 1) + f(2, 1) +0, 1) 

+ $0, 2) + f(1, 2) + f(2, 2° 

+ (0, 3) + fü, 3) 

+ f(0, 4) 


4 4-2 4 4—y 
—- Xe») Ye» 
z=0 y=0 y=02=0 
Some other examples are 


5 
P(z +y = 5) = 2 fe, 5 — 2) 


2 
P(e < 2ly = 3) = 2, Fs) 


§3.4 DISCRETE DISTRIBUTIONS 


2 s 
DD IEN 
P@ < 2ly > 3) = e  — 


T s 


Y Ys@w 


z=0y= 
P(x + y = 2\a + y? < 5) 2 
= J(0, 2) + f(1, 1) +f, 0) 
FO, 0) + f(0, 1) +40, 2) + £0, 0) +I, 1) + f(1, 2) +42,0) +52, 1) 


For three variables, the regions may be troublesome to visualize, and 
for more than three variables, we must rely on the analytical descrip- 
tion of the region to determine the required sums. Some relatively 
easy examples are 


3 4 6 
Pæ <3,y<42<2<6)= ) Y J Jya) 
2 


z20y-0z- 


4 
P(r-cy-4z-2)- Y f(z, 4 — a|2) 
z-0 


6 6—26—z— 


PoctytixO9- Y Y $ fuo 


ze0y-0 2=0 


6 6-2 
P@+yt+z=6) = D 2 fe w6-2- 
z=0 y= 

3.4. The Binomial Distribution. The binomial distribution is prob- 
ably the most frequently used discrete distribution in applications of 
the theory of statistics. It is the distribution associated with repeated 
trials of the same event. Suppose we denote by p the probability of 
success of some event. The event may be the occurrence of a head 
when a coin is tossed, in which case p = 14; it may be the occurrence 
of a seven when two dice are cast, in which case p = 14; it may be the 
occurrence of at least two aces when five cards are drawn from an ordi- 

nary deck, in which case 


(5 (2) QU] 
(5) 


5 


p= 


Or more generally, p may represent the probability of occurrence of 
some actual event to which no numerical a priori probability can be 


assigned. 
54 


THE BINOMIAL DISTRIBUTION §3.4 


Whatever the event, if the probability of its occurrence is p, the prob- 
ability of its nonoccurrence is 1 — p, since we cannot suppose that the 
event can both occur and not occur in a given trial. It will be con- 
venient to denote 1 — p by q, and in speaking of a given trial we shall 
say the probability of a success is p and the probability of a failure is q. 

pF 
Now suppose that » trials are made. We shall be concerned with the 


number of successes, x, that occur among the n trials. The variate x 
has the density 


fa) = (") pv OSeSn (1) 


p n H H . 
since there are (") orders in which x success and n — z failures can 


occur, while the probability for any particular order is pg". This 
distribution is the binomial distribution. It is a discrete distribution 
of one random variable, x 

The function contains two other variables p and n (q is not counted 
because it is determined by p) of a different character. . Their variation 
is between different binomial distributions; for a specific binomial dis- 
tribution, p and n must be given numerical values. Variables of this 
kind are called parameters. The function actually represents a two- 
parameter family of distributions, and a specific member of the family 
is given when p and n are given specific values. The parameter n is 
called a discrete parameter, since it can have only the isolated values 
1, 2, 3, + - + ; it would be meaningless to speak of, say, 2.53 trials. 
But p is a continuous parameter, since it can conceivably have any 
value in the range zero to one. Thus it is possible for p to be .5, say, 
in the case of a true coin, or possibly .5000037 in the case of a slightly 
biased coin. Any arbitrarily chosen number between zero and one 
is an allowable value of p. 

Two particular binomial distributions are plotted in Fig. 8. In (a), 
p= 4andn = 4;in (b), p = .8andn = 3. In general, the binomial 
density will have a maximum value determined as follows: Let m 
be the integral part of the number (n + 1)p and let e be the fractional 
part. Thus if n = 7 and p = .3, we have m = 2 and e = .4. The 
largest value of f(x) occurs when z is put equal to m; m is called the 
modal value or simply the mode of x. To prove that this value of z 
does maximize f(x), let us assume for the moment that e is not zero, 
and let us form the ratio f(z + 1)/f(z). We wish to show that this 
ratio is less than one when z is greater than or equal to m, and greater 

55 


§3.4 DISCRETE DISTRIBUTIONS 


than one when z is less than m. We are thinking of a situation like 
that illustrated in Fig. 9. Now 


f@+1) »n-z 


Jf) gz+1 
f(x) 
0.25 
0 [ 2 3 4 


Fie. 9. 


and if x is greater than or equal to m, then 
pn—z.pn-—m 
qtx--i- ^ qm-4l 


On substituting (n + 1)p — e for m, the right-hand expression may be 
written 


56 


THE BINOMIAL DISTRIBUTION §3.4 


pn—m _ (n+1) —[(1 — e)/g] 
qm--l (n+1)+[(1 — e)/p] 


which is certainly less than one. If z is less than m — ds 


x m 
> Bin +lq+e 
q(n+1)p-e 
n+1+e/q 
n+1-—e/p 


pate >eeom—)) 


+ 
Q 


> 


and is therefore greater than one. We have omitted the case 


z=m—t1 
here 
f@ +1) pn- mtl 
f(x) q m 
— (n D 9 e/g 
(n +1) — e/p 


which is again greater than one if e is not zero. If e = 0, the ratio is 
equal to one, and f(m) = f(m — 1); there are two largest values of 
f(x) which are equal and which occur at z = m and at z = m — 1. 
This situation is illustrated in Fig. 8(a) where (n + 1)p = 2 is an 
exact integer, so that f(1) and f(2) are two equal maximum values of 
f(a). 

For large values of n the appearance of the binomial distribution is 
generally like that of Fig. 9. In Fig. 8(b) the mode is at x = n when 
p = 8 and n = 3, but as n increases, the mode moves away from the 
extreme right end of the range; thus, if n = 100, we have 


101 X .8 = 80.8 


80 that the mode is 80 and is well away from the extreme value of 
x= 100. E 

The computation of binomial probabilities becomes troublesome 
when 7 is large. Approximate methods can be developed for comput- 


ing (") p*q"*, but we shall omit these because the computation of 


Single terms is rarely required. In most applications, partial sums are 
needed. "Thus we may require the probability that x be greater than 
57 


83.5 DISCRETE DISTRIBUTIONS 


an integer a, 
n 


P@>a)= Y fa) 
z-acLl 

Methods of computing such sums will be given in Chaps. 7 and 11. 

3.5. The Multinomial Distribution. The multinomial distribution 
is associated with repeated trials of an event which can have more than 
two outcomes. Thus the outcome of tossing a die may be any one 
of the six numbers 1, 2, + * - ,6. If the event refers to the appearance 
of aces when, say, seven cards are drawn, there are five possible out- 
comes: 0, 1, 2, 3, or 4 aces. 

In general, suppose there are k possible outcomes of a chance event, 
and let the probabilities of these outcomes be denoted by py, ps; * * ; 
pr. Obviously we must have 


Sed (1) 


just as p + g = linthe binomial case. Suppose the event is repeated 
n times, and let z; be the number of times the outcome associated with 
pı occurs, let xz be the number of times the outcome associated with pa 
occurs, and so forth. The density for the random variables x1, x», 
or me edel 


k 
n! 
fmi ea °° * | Ga) == Il pi (2) 
Hh sc 
i=1 


where the range of each 2; is zero to n inclusive, subject to the restric- 
k 


tion that 3 x; =n. We have written the function as one involving | 
i=1 
only k — 1 of the z;'s since only k — 1 of them are functionally inde- 


; k 
pendent; x, is exactly determined by the relation Y z; — n when the 
1 


$1, * * * , 1 are specified. Thus this is a multivariate distribution 
involving k — 1 variates. The 2; on the right-hand side of (2) is to be 
interpreted as merely a symbol for the expression 


TU — 01 8 ped 
The expression (2) is a k-parameter family of distributions, the 


parameters being n, pi, p», * * * , pra. The other variable p; is, like 
58 


THE POISSON DISTRIBUTION §3.6 


q in the binomial distribution, exactly determined by 


De =E Dies Dae pa 
A particular case of a multinomial distribution is obtained by putting, 
e.g., n = 3, k = 3, pı = .2, p» = .3 to get 


3! 
mm -z 3 6969 *^ 


fn, 23) = 


This function is plotted in Fig. 10. 


Fra. 10. 


Tt may be shown by a direct generalization of the argument used in 
the preceding section that the maximum value of f(a, e, * * * , 2&3) 
occurs when the z; are put equal to m;, the integral parts of (n + 1)p;. 

3.6. The Poisson Distribution. The Poisson density is represented 
by the function 


f(x) = 


which has an infinite range. Since the exponential e" has the series 
expansion 


220123: a) 


e "m 
z! 


§3.6 DISCRETE DISTRIBUTIONS 


it follows that 
X f(@) =1 


z=0 

The distribution has useful application in situations where a large 
number of objects are distributed over a large area. To consider a 
concrete example, suppose a volume V of fluid contains a large number 
N of small organisms. It is assumed that the organisms have no social 
instincts, and that they are as likely to appear in any part of the fluid 
as in any other part with the same volume. Now suppose a drop of 
volume D is to be examined under a microscope, what is the probability 
that x organisms will be found in the drop? We assume that V is very 
much larger than D. Since the organisms are assumed to be dis- 
tributed throughout the fluid with uniform probability, it follows that 
the probability that any given one of them may be found in D is D/V. 
And since they are assumed to have no social instincts, the oceurrence 
of one in D has no effect on whether or not another occurs in D. The 
probability that x of them occur in D is therefore 


Ge C @ 


We are also assuming here that the organisms are so small that the 
question of crowding may be neglected; all N of them would occupy 
no appreciable part of the volume D. The Poisson density is an 
approximation to the above expression, which is simply a binomial 
density in which p = D/V is very small. 

The Poisson distribution is obtained by letting V and N become 
infinite in such a way that the density of organisms N /V = d remains 
constant. Rewriting (2) in the form 


N(N—1(N-2)---(N—z41 (Py i AD 
zIN* UY NV 


e za) - 3) "s ( - #5) wa-(1 = Bay 


x! 


the limit as N becomes infinite is readily seen to be 
e-P*(Dd)* 
z! 
which is the same form as (1) if we put Dd = m. This derivation 


Shows that m is the average value of z, since D, the volume of the 
60 


OTHER DISCRETE DISTRIBUTIONS §3.7 


portion examined, multiplied by the over-all density d gives the aver- 
age number expected in the volume D. 

We have gone into some detail in discussing this distribution because 
it is often erroneously applied to data which do not fulfill the assump- 
tions required by the distribution. "Thus it cannot be used, for exam- 
ple, in studying the distribution of insect larvae over some large crop 
area, because insects lay their eggs in clusters so that if one is found in 
a given small area, others are likely to be found there also. 

The Poisson density function is perhaps best thought of as an 


approximation to the binomial density, [3 pq’, when Np is large 


relative to p and N is large relative to Np. It is particularly useful 


when N is unknown. 
3.7. Other Discrete Distributions. The hypergeometric distribution 


is 
Quom 
s/N- 
ALA N a 
m + " ) 
T 
Equation (3.3) gives a special example. Equation (3.2) is an example 


of a bivariate hypergeometrical distribution. 
The uniform distribution is 


f(x) = 


Ke) = 2 sewn @) 


The casting of a die provides an example. 
The negative binomial distribution is 


seen (tty Ve 22612--- G) 


and Zf(z) = 1 since 


o 


Oe 1 E 
EXC trees Uu p 
An example is provided by letting p be the probability of success and 
q be the probability of failure of a given event. Let f(x) be the prob- 
ability that exactly x +.r trials will be required to produce r successes. 
The last trial must be a success, and its probability is p. Among the 
other x + r — 1 trials there must be r — 1 successes, and the prob- 
61 


§3.8 DISCRETE DISTRIBUTIONS 


Ce rte LA Ter 
rer Big 


The product of these two probabilities gives the desired probability, 
f(x), and is the same as (3). 


ability of this is 


3.8. Problems. Specify range of variates for every distribution. Do 
not obtain numerical answers which require lengthy computations. 


1. Five cards are dealt from an ordinary deck. What is the density 
function for the number of spades? 

2. Ten balls are tossed into four boxes so that each ball is equally 
likely to fall in any box. What is the density for the number of balls 
in the first box? 

3. A coin is tossed until a head appears. What is the density for 
the number of tosses? 

4. What is the density for the number that appears when a die is 
cast? 

5. Two dice are cast. What is the density of the sum of the two 
numbers which appear? 

6. Cards are drawn from an ordinary deck without replacement 
until a spade appears. What is the density for the number of draws? 

7. Ten dice are cast. What is the density of the number of ones 
and twos? 

8. An urn contains m black and n white balls. k balls are drawn 
without replacement. What is the density of the number of white 
balls? Specify the range for the various relative sizes of m, n, and k. 

9. Three coins are tossed n times. Find the joint density of x, 
the number of times no heads appear; y, the number of times one head 
appears; and z, the number of times two heads appear. 

10. A machine makes nails with an average of 1 per cent defective. 
What is the density of the number of defectives in a sample of 50 nails? 

11. An urn contains 10 white and 20 black balls. Balls are drawn 
one by one, without replacement, until 5 white ones have appeared. 
Find the density of the total number drawn. 

12. Seven cards are drawn without replacement from an ordinary 
deck. Find the joint density of the number of aces and the number 


» (;) 5 s ) p d 3 3 


13. Show that 


PROBLEMS §3.8 


by equating coefficients of a? in 
(1 + 2)*(z + 1) = (1 + r) 


Hence verify algebraically that the sum of the hypergeometric density 
18 one. 

14. Use the result of Prob. 13 to find the marginal density of the 
number of aces from the result of Prob. 12, 

15. In a town with 5000 adults, a sample of 100 are asked their 
opinion of a proposed municipal project; 60 are found to favor it and 
40 to oppose it. If in fact the adults of the town were equally divided 
on the proposal, what would be the probability of obtaining a majority 
of 60 or more favoring it in a sample of 100? 

16. A distributor of bean seeds determines from extensive tests that 
5 per cent of a large batch of seeds will not germinate. He sells the 
seeds in packages of 200 and guarantees 90 per cent germination. 
What is the probability that a given package will violate the guarantee? 

17. A manufacturing process is intended to produce electrical fuses 
with no more than 1 per cent defective. It is checked every hour by 
trying 10 fuses selected at random from the hour’s production. If one 
or more of the 10 fails, the process is halted and carefully examined. 
If in fact its probability of producing a defective fuse is .01, what is 
the probability that the process will needlessly be examined in a given 
instance? 

18. Referring to the above problem, how many fuses (instead of 10) 
should be tested if the manufacturer desires that the probability be 
about 0.95 that the process will be examined when it is producing 
10 per cent defectives? 

19. A has two pennies; B has one. They match pennies until one 
of them has all three. What is the density of the number of trials 
required to end the game? 

20. Referring to the above problem, what is the density of the num- 
ber of trials given that A wins? 

21. A die is cast ten times. What is the probability that the number 
of ones and twos will not differ by more than two from its modal value? 

22. A Poisson distribution has a double mode at v = 1 and z = 2; 
what is the probability that z will have one or the other of these two 
values? 

23. Red-blood-cell deficiency may be determined by examining a 
Specimen of the blood under a microscope. Suppose a certain small 
fixed volume contains on the average 20 red cells for normal persons. 

63 


§3.8 DISCRETE DISTRIBUTIONS 


What is the probability that a specimen from a normal person will 
contain less than 15 red cells? 

24. An insurance company finds that 0.005 per cent of the popula- 
tion dies from a certain kind of accident each year. What is the prob- 
ability that the company must pay off on more than 3 of 10,000 insured 
risks against such accidents in a given year? 

25. A telephone switchboard handles 600 calls on the average during 
arush hour. The board can make a maximum of 20 connections per 
minute. Use the Poisson distribution to estimate the probability 
that the board will be overtaxed during any given minute. 

26. A die is cast until a six appears. What is the probability that 
it must be cast more than ten times? 

27. Two dice are cast ten times. Let z be the number of times no 
ones appear, and let y be the number of times two ones appear. What 
is the probability that z and y will each be less than 3? 

28. In Prob. 27 what is the probability that « + y willbe 4? What 
is the probability that z + y will be between 2 and 4 inclusive? 

29. A die is cast twenty times. What is the probability that there 
will be at least twice as many ones and twos as there are threes? 


30. Ten cards are drawn without replacement from an ordinary deck. . 


What is the probability that the number of spades will exceed the 
number of clubs? 

31. Suppose a neutron passing through plutonium is equally likely 
to release 1, 2, or 3 other neutrons, and suppose these second-generation 
neutrons are in turn each equally likely to release 1, 2, or 3 third- 
generation neutrons. What is the density of the number of third- 
generation neutrons? 

32. Using the density of Prob. 12, find the conditional density of the 
number g of aces, given the number y of kings. 

33. Using the density of Prob. 9, find the conditional density of x 
and z, given y. 


Determine the sums required to compute the following probabilities 
using density functions with as many variates as needed. Assume all 
variates take the values: 0, 1, 2, * * + , m. 


34. P(2r + y € 3) 38. P(x > y > z) 
35. P(z? + y? = 25) 39. P(x + y = 5|y = 3) 
36. P(x? < 5|] € y < 6) 40. P(x + y = 5|2 = 3) 


37. Pz > 2y—a),0<a<m 41. P@<3,y<4,2>5,w>6) 
42. Pa < z&€by-22,0«a«b«m j 
43. P(x > 2y|x >z) 

64 


CHAPTER 4 
DISTRIBUTIONS FOR CONTINUOUS VARIATES 


4.1. Continuous Variates. A continuous variate is one that is not 
restricted to have only isolated values; it may have any value in a 
certain interval or collection of intervals. 

To consider an example, suppose a rifle is perfectly aimed at the 
center of a square target and fired several times after being clamped 
in that position. The bullets will 
not all strike the center, because 
minor variations in the weight of the 
bullets, shape of the bullets, in the 
effect of humidity and temperature 
on the powder, and other factors, 
will cause variations in the trajec- 
tories of the bullets. After a few 
shots the appearance of the target 
might be represented by Fig. 11. 
Let a random variable z be defined Fic, 11. 
as the horizontal deviation of the 
center of a hit from a vertical line through the center of the target. 
Clearly z may have any value in its possible range of variation. 

The number of possible values of x is infinite. In fact, any finite 
interval, however small, contains an infinite number of points. The 
interval .001 to .002, for example, contains among others the points 
-0011, .00111, .001111, .0011111, and so on. This fact raises some 
difficulties about defining the probability of z. In order to understand 
the problem, we must digress briefly to consider the number of points 
in an interval. 

The number of positive integers is infinite; it is called a denwmerable 
infinity. The symbol Ao will be used to denote a denumerable infinity. 
Any set of objects which can be put into one-to-one correspondence 
with the positive integers will be said to contain A» objects. Thus 
the set of even integers contains Ao elements, for we can set up the 
correspondence 


2:47 6, 8, CO ERES On ENSE 
JENA c nc Home tun 
65 


, 


841 DISTRIBUTIONS FOR CONTINUOUS VARIATES 


The set of numbers .5, 1, 1.5, 2, 2.5, . . . also has Ao elements, since 
we can set up the correspondence 


AE e ese aay 
9 2! 2 29 í9! 
e 


The set of unreduced proper fractions is also denumerable, since we 
may set up the correspondence 


ENZ E ELE EE 
213) B 4U1/ VIDD bi 550 espe? 
1,2,3,45,0,7,8,9,10,---,m-'- 
where r is the largest integer for which r(r — 1)/2 < n and 
ee r(r - 1) 


Thus for n = 9, we have r = 4, j = 3. 

This last example shows that the number of rational numbers (frac- 
tions) on the interval zero to one is at most a denumerable set. Actu- 
ally, in our sequence, every reduced fraction is counted Ao times. 
Thus 24, for example, appears as 


2416) 18^ cw P UON 

3’ 6 g 12 , , 3n. 
which is obviously a denumerable set. In the theory of sets, it is 
shown that every infinite subset of a denumerable set is also denumer- 
able. This theorem together with our last example shows that the 
number of rational points on the interval zero to one is a denumerable 
set. It can also be shown that the number of rational points on the 
whole x axis is denumerable. 

The total number of points on a finite interval, say the interval from 
zero to one on the x axis, is called a continuous infinity. This infinity 
is very much larger than a denumerable infinity and will be denoted by 
Ai; We shall not prove that 4; is larger than Ao, but it becomes 
reasonable when we attempt to count the points on the unit interval. 
Every point on the unit interval may be represented by an infinite 
decimal. Thus the point 14 may be represented by 


183333 +>: 
and 14 may be represented by 


.2500000 - > - or by -2499999 == + . 
66 


m—— 


CONTINUOUS VARIATES §4.1 


Conversely every infinite decimal corresponds to a distinct point on 
the unit interval. We can count the number of possible decimal 
expansions as follows: The first place can be filled in 10 ways, the 
second in 10 ways, the third in 10 ways, and so forth. The first n 
places can therefore be filled in 10" different ways. The number of 
infinite decimal sequences is therefore 104, since there are Ao places in 
the sequence. When we compare 105 with 5, 102° with 20, 10199? with 
1000, it becomes reasonable to suppose that 104» is of an entirely 
different order from Ao. This number, 10“, is Ay. Actually there 
are more decimal expansions than points, because of certain duplica- 
tions, as illustrated above for the point at 14, but these duplications are 
denumerable and may be neglected relative to Ai. Any finite number 
n raised to the power Ao can be shown to be equal to any other raised 


Fic. 12. 


to that power. Since the number of points on the unit interval, Ai, 
satisfies the relation É 
9^ < A, < 10^ 


it follows that A; = 104° since 94° = 104. The equality sign here is 
used to mean one-to-one correspondence. 

We can easily show that the number of points on the whole z axis 
is A, We may set up a correspondence by means of the function 


1 b 
RR 20 
Wc iz 

1 RA 
e ieee < 
1 gm ifa <0 


which is plotted in Fig. 12. Corresponding to every value of « there 

is a unique value of y between zero and one, and conversely there is a 

unique value of x for every value of y between zero and one. Thus we 

have a one-to-one correspondence between the points on the infinite 

* axis and the unit interval on the y axis. The number of points on 
67 


84.2 DISTRIBUTIONS FOR CONTINUOUS VARIATES 


the x axis is therefore Ai. It can also be shown that the number of 
points in any finite interval, however large or small, is A;. The cor- 
respondence is set up as in Fig. 13. Let Z and J be any two intervals 
of different lengths, and let P be the point of intersection of two lines 
joining their end points as illustrated. Any point x of J is made to 
correspond to the point y of J which lies on the line joining x and P. 
Thus any interval can be related to the unit interval. 

Even more bizarre results than these could be obtained by pursuing 
the theory of sets further. Thus, for example, the number of points 
in a finite or infinite plane is also Ai. But we have enough results for 
our immediate purposes. The important idea is the distinction 
between the two infinities—denumerable and continuous. There are 
a denumerable infinity of rational points in any interval, but the total 


Fic. 13. 


number of points is Ai, and the number of rational points is entirely 
negligible relative to the total number. We could remove all the 
rational points and essentially the whole interval would still remain. 

We can now distinguish precisely between diserete and continuous 
variates. A discrete variate is one which ean take on a finite number 
of values or a denumerable infinity of values. A continuous variate 
is one which can take on a continuous infinity of values. 

4.2. Probability Functions for Continuous Variates. In the case of 
discrete variates it is possible to have a finite probability associated 
with each admissible point, even when the number of points is infinite, 
and yet have the sum of the probabilities equal to one. Thus if z is 
the number of tosses required to obtain a head with a coin, we have 
seen that the density of z is 


f@) = (4)? z-123,4--- 
and 
à f -1 
4 68 


PROBABILITY FUNCTIONS FOR CONTINUOUS VARIATES §4.2 


In the case of a continuous variate this is not possible. No matter 
how rapidly we try to make the probabilities converge to zero, their 
sum will nevertheless be infinite unless practically all the points (all 
but a denumerable set) are given probability zero. Referring back 
to the horizontal deviations of rifle shots on a target, it is clear that all 
values of x within a small interval will be about equally likely, and it 
cannot reasonably be assumed that most of these points have probabil- 
ity zero while some few others have finite probabilities. 

We have encountered a difficulty which, it is to be pointed out, is 
purely logical. From a practical point of view the difficulty is obscured 
by the fact that we could not actually distinguish between a deviation 
of .5 inch and one of .500003 inch. We are limited by the accuracy 
of whatever measuring device we use, and a deviation can be identi- 
fied only within a certain interval. Thus if we can measure only to 
within a hundredth of an inch, we might measure a deviation to be 
4.26 inches. This would be interpreted to mean that the deviation 
lies somewhere in the interval 4.25 to 4.27 inches and might better be 
written 4.26 + .01 to indicate this fact. 

The logical problem is met by dealing with intervals rather than 
individual points. Let us first examine some empirical probabilities 
for intervals. Suppose the rifle is fired 100 times at the target of 
Fig. 11, and suppose the target area is divided into strips by drawing 
vertical lines on it 1 inch apart. Letting the deviations x be negative 
to the left of the central line, suppose the vertical lines are drawn at 
x = +1, +2, +3, and so on. Now for a given strip, say the one 
with 0 < x < 1, the number of shots in that strip divided by 100 will 
be the empirical probability that a deviation will be between zero and 
one. We may tabulate a hypothetical distribution of shots and com- 
pute the empirical probabilities as in the accompanying table. The 


Strip Number of shots | Empirical probability 

—5«r«-—4 1 -01 

—4 <r< -3 1 -01 

—3 <r < -2 6 .06 S 

—2<2< -1 13 13 

=I <r <0 24 .24 
0cz«l 27 -27 
1«z«2 16 .16 
2<2<3 7 .07 
3<a<4 3 .03 
4<r<5 2 -02 


, 
o 
© 


84.2 DISTRIBUTIONS FOR CONTINUOUS VARIATES 


empirieal distribution represented by this table could be plotted by 
using vertical lines as was done with discrete distributions. However, 
we shall not plot a line at say the mid-point of each interval but shall 
prefer to use a rectangle with height equal to the probability divided 
by the width of the interval, and with a width equal to the width of 
the interval. This is done to indicate that the probability refers to 
the whole interval rather than to any single point in the interval. 
The result is shown in Fig. 14. 

Referring to Fig. 14, we note that the area of one of the rectangles is 
equal to the empirical probability for the interval corresponding to it, 
since the height of the rectangle is equal to the probability and the 
base is one. We shall focus attention on the areas rather than the 
heights. The sum of the areas of all the rectangles is one. For 


Fic. 14. 


intervals other than those chosen originally, we may also estimate 
probabilities. Thus we would estimate the probability that 0 < x < 2 
by adding the areas of the two rectangles over that interval to get 
-43. To estimate the probability that, say, —.25 < x < 1.5, we 
would compute the area over that interval to get 


06 + .27 + .08 = .41 


If a second 100 shots were fired at the target, we could obtain 
another empirical distribution, which would in all likelihood be differ- 
ent from the first though its general appearance might be similar. In 
constructing a theory of probability, we like to think of these empirical 
probabilities as being estimates of some ‘‘true” probability. To this 
end we assume the existence of a curve f(x) such as that plotted in 
Fig. 15. We may not be able to specify the function, but we assume 

70 


PROBABILITY FUNCTIONS FOR CONTINUOUS VARIATES 84.2 


that there is some function which will give the correct probability for 
any interval. The probabilities are given by areas under the curve, 
not by values of the function, Thus 


PO <2z<1)= EO 


and this is the probability that is estimated by the area of the rectangle 
over the interval 0 < z < 1 in Fig. 14. 

The function f(x) is thought of as a smooth curve rather than a step 
funetion for the following reasons: In the first place it is recognized 
that the choice of intervals in any actual experiment is purely arbitrary. 
In the rifle experiment we could just as well have used intervals l6 inch 


f(x) 


D) x 
Fia. 15. 

long, or intervals with end points at 1.2, 2.2, 3.2, for example, or we 
could have used intervals of different lengths—0 to .5, .5 to 1.5, 1.5 to 
3, for example. So the steps of the empirical distribution have no 
particular significance. In the second place, suppose we consider two 
small intervals at a division point, say 1.9 < x < 2 and 2 < x < 2.1. 
Since the second interval is farther removed from center than the first, 
we should expect its probability to be somewhat smaller, but it is not 
reasonable to suppose a deviation is more than twice as likely to appear 
in the first interval, as is indicated in Fig. 14. The smooth curve 
gives a more reasonable relation between the two probabilities. In 
the third place, experiments with a large number of trials usually indi- 
cate that there are no abrupt changes in the distribution curve. Thus 
if the rifle were fired, say, 1000 times, and if intervals 149 inch wide 
were used, the steps would likely be much smaller than those of Fig. 14 
and approximate a smooth curve. , 

In general, a probability density function for a continuous variate 
Will be a function f(z) defined over the range of the variate, and the 

7i E 


§4.2 DISTRIBUTIONS FOR CONTINUOUS VARIATES 


range may be finite or infinite. It is often convenient to think of the 
variate as always having an infinite range; when the range is actually 
finite, f(x) may be defined to be zero outside the range. The function 
must be positive or zero, and the area under the curve must be one. 
Symbolically, the requirements for a density function are 


(a) f(x) 20 
6) [fois =1 


The probability that the variate z falls in any interval a < x < b is 
given by the integral 


Paa<x<b) = no 


Since the area over a point is zero (a geometric line has no area), it is 
customary to define the probability that x has any particular value to 
bezero. We may, in fact, argue that the probability is zero as follows: 
To compute the probability that z will be some number a, let us find 
the probability for a small interval of width 2c about a: 


PG@—c<a<ate) = [^7 fade 


The integral is equal to 2cf(a^) where a’ is a properly chosen point in 
the interval a — c toa +c. (A point a’ is determined by construct- 


ing a rectangle of area f. uz f(«)dx over the interval. The top side 


of the rectangle will intersect the curve f(x) at one or more points if 
the curve is continuous, as we suppose it is. Any one of these points 
may be chosen asa’. a’ is obviously dependent on c and will approach 
a as c approaches zero.) Now we shall let c approach zero and define 


P@ = a) = lim Pa—¢<2<ate) 
= lim 2cf(a^) = 0 


We have defined an interval by the expression a < x < b, but we 
could equally well have used a <£ <bora<x<bora<a«<b 
without changing the probability associated with the interval. A 
matter of one or two points does not change the probability for a 
continuous variate because the probability associated with a single 
point is zero. In fact, a denumerable set of points could be omitted 
from the interval without affecting the probability associated with it. 

In specific ideal situations, we may be able to say what the exact 
function f(x) is, just as we did in dealing with a priori probabilities. 
But in practical situations, f(x) will ordinarily be unknown. 

1 72 


PROBABILITY FUNCTIONS FOR CONTINUOUS VARIATES 84.2 


Any positive function over any arbitrarily chosen range may be 
regarded as a density function for some hypothetical variate over that 
range, provided the function is multiplied by a constant which will 
make the integral of the function over the range equal to one. Thus, 
3 + 2x, for example, may be made a density function over the range 
2<a<4, Since 


[| @ 23s = 18 


the following function is a density function: 


f(z) =0 z«2 
—Ma(842:2) 2<r<4 
- z24 


Fie. 16. 


The function is obviously positive or zero, and 
[feas = f? 0az + f 488 + 2u)de + f, 0de 

=0+1+0 

=1 
The probability that a variate having this density will fall in the 
interval 2 «x < 3, for example, is 

PQ <2 <3) = [Ms + 22) 
=% 


The function is plotted in Fig. 16. 
73 


§4.3 DISTRIBUTIONS FOR CONTINUOUS VARIATES 


4,3. Multivariate Distributions. Going back to the rifle experi- 
ment, we may characterize each shot not only by its horizontal devia- 
tion z but by its vertical deviation y measured perpendicularly from a 
horizontal line through the center of the target. Suppose a large 
number of shots are fired, and suppose the target is divided into 1-inch 
squares by means of horizontal and vertical lines 1 inch apart. We 
could count the number of hits in each square and compute an empir- 
ical probability for each square. By plotting columns with heights 
equal to the empirical probabilities over each square, we might get a 


Fie. 17. 


result like that illustrated in Fig. 17. The volume of a column esti- 
mates the probability that a shot will fall in the square over which the 
column is constructed. 

We shall naturally idealize this situation by postulating the existence 
of a function f(x, y) which would plot as a smooth surface over the 
x, y plane. The probability that a shot falls in a given region is repre- 
sented by the volume under the surface over that region. One 
quarter of such a surface is illustrated in Fig. 18. The probability 
that x and y fall in the rectangular region 0 < x «a, 0 « y <b 
illustrated in the figure is 


PO <2<a0<y<b)= f" [i fG, y)dy dz (1) 
7 


MULTIVARIATE DISTRIBUTIONS §4.3 


As in the case of one variable, we require 


J, y) 20 a 
[ ie ke f(z, y)dy dz = 1 m 
The function f(x, y) is called the joint density function for x and y. 
Fey) 
D 
x 
Fro. 18. 


As an illustration, the function 6 — x — y is positive over the 
rectangle 0 < x < 2, 2 < y < 4, for example; hence it may be used 
to define a joint density function over that region. Since 


; IN @-2-udy dz =8 


The following is a density function: 
fy) -140—z—y) O0«z«22«y«4 (4) 
=0 otherwise 


If x and y are random variables having this density, the probability 
that they will fall in the region z « 1, y « 3, for example, is 


Pe «1y«3) = [', f, fe Ddy de 
= f; fi 386 = 2 — y)dy de 
=3 


75 


844 DISTRIBUTIONS FOR CONTINUOUS VARIATES 


The probability that x + y will be less than three is 


1 f3-z 
Pee +y <3) = fr [i 10 — 2 — y)dy de 
= 9$4 
The probability that x < 1 when it is known that y < 3 is 


Pe < tly < 3) = Peu C9) 


We have already computed the numerator of this expression, and the 
denominator is 


2/3 
Py <3) = f; f; 186 — z — y)dy dz 
=3 


hence 


Z 
P(e < lly < 3) = 5% = 


The extension of these ideas to the case of more than two variates 
is apparent. In general, any function f(ri zs, : : * , a) may be 
regarded as a density function of k random variables, provided that 


fuas > ++, te) 20 (5) 

I. f feminini dni 
The probability that a point (zi, zs, * * * , 2) falls in any given region 
of the k-dimensional space is obtained by integrating the density func- 


tion over that region. 
The function 


S(a1, T2, 3, 24) = lOrirerac, ere | (6) 
=0 otherwise 


is a density function since it satisfies the two requirements. The 
probability that a point falls in the region zı < 14, z, > 14 is 


P(t: < 16,2, > 14) = LS Ta fa S(@1, 22, 23, 24) ds drdi: daa 


5 [i [p E l6xixorara dx dta dx; daa 


29 
4.4, Cumulative Distributions. Since in the case of continuous 


variates the probabilities are given by integrals, it is often convenient 
76 


CUMULATIVE DISTRIBUTIONS §4.4 


to deal with the integrals of the densities rather than the densities 
themselves. Let f(x) be a density function for one variate (such as is 
plotted in Fig. 15, for example) and let 


Fl) = f? swat (1) 


This funetion F(z) is the probability that the value of an observation 
will be less than z. Thus 
F(a) = P(x < a) (2) 


F(a) is called the cumulative distribution function of x, or simply the 
cumulative distribution. The graph of a cumulative distribution 


Fic. 19, 


function is illustrated in Fig. 19. Any function F(x) may be regarded 
as the cumulative distribution of a random variable, provided that 


F(x) is a nondecreasing function (3) 
F(—«~) =0 (4) 
F(«) =1 (5) 


and given the cumulative distribution, one can find the density by 
differentiating it: 
_ dF(x) 6 
Ie) = Ge x 
The probability that æ falls in an interval a < x < bis, in terms of the 
cumulative distribution, 


P(a <2 <b) =P@<b)—P@X< a) 
= F(b) — F(a) (7) 
ki 


84.4 DISTRIBUTIONS FOR CONTINUOUS VARIATES 


Referring to the example at the end of Sec. 2, where 
f@)=MsG+2z) 2<r<4 


=0 otherwise 
we find 
Fæ) =0 2 <2 
= [in M(a(8-4-2x)àz = Mg(s*--3z —10 | 2«c«4 
=1 a>4 


F(x) 


and the probability is 
P(2 < x < 3) = F(8) — F(2) 
= 4s(9 +9 — 10) — 0 
=% 
The function is plotted in Fig. 20. 
For several variates the cumulative distribution is defined similarly: 
F(@1, tay «°° s Ta) 
= jim va msc Js. f(t, à; 573 t,)dty digi * * * dti (8) 
where f(zi, 2», - + + , xx) is the density. The value of the cumula- 
tive distribution at the point (ai, as * * * , ax), for example, is the 
probability 
P(a1 < Qi, t2 < aa, * * * , 2e < aj) = F(a às, * * - , a) (9) 
Any function F(a, 2», * * - , z;) may be regarded as a cumulative 


distribution of k variates, provided that 
. 78 


CUMULATIVE DISTRIBUTIONS 84.4 


F(ric»--*-*,2) is nondecreasing in every variate (10) 
Paucos p cres (11) 
Bo Cee Co aS OH) aK) (12) 


and this last condition is intended to indicate that F vanishes if any 
one of the variates approaches minus infinity. Given the cumulative 
distribution F, the density may be found by differentiating F with 
respect to each of its variates: 

ð à ð 


f(xy 25 ai E re ttam] 09 2» ++ +, te) (13) 


Fra. 21. 


To illustrate a cumulative distribution for two variates, we may use 
the density given in equation (3.4): 


femy-i$6—-zs—y O0«e«22«y«4 (14) 
=0 otherwise 


There are nine regions in the 2, y plane to be taken account of in defin- 

ing F(a, y); the nine regions are indicated in Fig. 21, in which the 

Coordinates of the points of intersection of the lines are given. (The 

left vertical line coincides with the y axis.) This complication arises 

because of the piecewise definition of f(x, y). We could simply state 
79 


84.4 DISTRIBUTIONS FOR CONTINUOUS VARIATES 


that 
Fæ, y) = f^, [* Se, 0dtds (15) 


but a more detailed characterization of the function will be required 
if it is to be useful. In region 1 of Fig. 21 f(x, y) is zero; hence 


Fy) =0 z«0gy«2 


Tn region 2, although y is greater than two, we have x < 0, so that (15) 
is still zero since f(s, t) never becomes positive over the range of inte- 
gration. The same is true in regions 3, 6,7. For z, y in region 5, the. 
integrand is not zero when 0 < s < z, 2 < t < y, and we have 


F(a, y) = JEE 14(6 — s — t)dt ds 
= f'ale-9e-»-£«s]e 
—Me(y—2)(0—9y—2) 0<2<2,2<y<4 (16) 


For any point in region 4, the integrand in (15) is positive when 
0<s<2,2<t «4; hence 


4 
FG, y) = [fp fs datas 
and this integral may be computed by putting y — 4 in (16) to get 
F(a, y) = r6- 2z) O<r<2,y>4 


Similarly, in region 8, F(x, y) = F(2, y) when æ > 2, so that 
F(z, y) = u -28 -y) z»222«y«4 


and in region 9, F(x, y) = 1. Combining these results, 


F, y) =0 z«0oryc«2 
=Mery—2)10-y—2)  0«2«2,2«y«4 
= lés(6 — z) 0czr«2y»54 (17) 
= ky — 2)(8 — y) 27»22«y«4 
e . z>2,y>4 


The function is plotted in Fig. 22. 
The probability that a point (x, y) will fall in any rectangle, say 
a < x < bi, a2 < y < b», may be written in terms of the cumulative. 

distribution as follows: 
80 


CUMULATIVE DISTRIBUTIONS 844 


P(ai < z < b az < y < be) = P(t <b, y < be) = P(e «a, y < b) 
=P z by Y < as) 
+ P@ < a1, y < a) 
= F(bi, b2) — F(as, be) — F(bi, as) 
+ F(a, a) (18) 
Thus, in the above example, 


PO <@<1,3 <y <4) = F(1, 4) — F(0, 4) — F(1, 3) + F(0, 3) 


=% -0-3 +0 
STA 


6 7 8 x 


Fic. 22. 


These distributions can become quite complex for several variables, 
and in fact many important problems in applied statistics remain 
unsolved merely because the integrations required for their solution 
are too complex to perform. Modern developments in high-speed 
computing machines promise to remedy this situation within the next 
few years. 

In this book we shall ordinarily use small letters to denote probabil- 
ity density functions and the corresponding capital letters to represent 
their cumulative forms. Thus, 


Ga) = [*. «(oat 
or if the variate is discrete, 
G(x) = Y, g(t) 


The word density will refer specifically to g(x), while the phrase cumu- 
lative distribution will refer specifically to G(x). The word distribution 
81 


§4.5 DISTRIBUTIONS FOR CONTINUOUS VARIATES 


will be used as a more general term and may refer to either the density 
or its cumulative form. 

4.5. Marginal Distributions. Associated with any distribution of 
more than one variable are several marginal distributions. Let f(z, y) 
be a density for two continuous variates. We may be interested in 
only one of the variates, say z. We therefore seek a function of x 
which when integrated over an interval, say a < x < b, will give the 
probability that z will lie in that interval. In the z, y plane such an 


X 


Fra. 23. 


interval corresponds to a strip as illustrated in Fig. 23. The specifica- 
tion a < z « b is satisfied by any point in the strip; hence 


Pa <z <b) = f? [T fis duds () 


Whatever the specification on 2, the limits of integration for y are — © 
to +, so we may define a function, say 


fiz) = f 7, He, Way 2) 

and this function is the required marginal density, since 
Pla < s <b) = f^ hade (3) 
for any pair of values a and b. Similarly the marginal density of y is 
fly) = [ 7. fe, vde o 
In general, given any density f(a, 2», * * * , £z), one may find the 


marginal density of any subset of the variates by integrating the func- 
82 


CONDITIONAL DISTRIBUTIONS 84.6 


tion with respect to all the other variates between the limits — œ and 
+œ. Thus the marginal density of xı, x2, and x4, for example, is 


Fisa(2, 22, 24) 


= f oor femme imam des dm + + + dmr (5) 


Referring to the distribution defined in equation (3.4), the marginal 
density of x is 


file) = [7 fe, ny -o «zo 
= f; 346 — 2 — ty (EX e 
= (8 — 2) 0<r<2 
=0 z«0orr22 (6) 


The cumulative marginal distribution is easily found if the cumula- 
tive distribution is given. For two variables, the cumulative marginal 
distribution of z is 


Fi) = f? f- fie, Ddy de = fË fiie (7) 
= F(z, ©) 

Thus we need only let the variable in which we are not interested 
become infinite in the joint cumulative distribution. And in general, 
if F(a, zs, + * * , a) isa k-variate cumulative distribution, the cumu- 
lative marginal distribution of 71, 2», ws, for example, is 

Fy24(1, 22, 24) = F(t, 2, ©, 24, ©, * * * , o) (8) 
In our specific example we may find the cumulative marginal distribu- 
tion of x by integrating fi(x); thus 


Fi) =f" fat 


=0 cO 
= e(6— 1) 0<2<2 (9) 
=] 22 


The same result is obtained by letting y become infinite in F(x, y) 
given by equations (4.17). 

4.6. Conditional Distributions. We shall consider first a bivariate 
density, say f(x, y), which might be represented by the surface of 
Fig. 18, for example. Suppose a point (2, y) is drawn (a shot is fired 

83 


84.6 DISTRIBUTIONS FOR CONTINUOUS VARIATES 


at a target, for example), and suppose the second variate y is observed 
but not the first. We seek a function, say f(z|y), which will give the 
density of z when y is known; i.e., a function such that 


P( <z < bly) = [^ fGlidz 1) 


for any arbitrarily chosen a and b. 

If we change the above problem so that it concerns probabilities 
rather than distributions of continuous variates, we may use the 
definition of conditional probability given in Sec. 2.7. Thus we may 
compute (assuming c > 0) 


Í j le f(s, tat ds 
hes [fos dds ae 


-e 


Paa<s<bly—c<t<y+ec)= 


The denominator may be written in terms of the marginal density of 
y, say fo(y), as 


L3 roa 


and this is equal to 2cf;(y^), where y’.is some value in the interval 
y —ctoy-rc. Similarly the numerator of (2) is equal to 


2c (^ f(s, y')as 


where y" is some point in the interval y — c to y +c. Hence the 
probability is 


b 

f(s, y”)ds 

Pa<zx<by—c<ti< pg = he ds 3 

y +e) f (3) 

Now we shall let c approach zero. Since y’, y”, and t are all in the 

interval y mU to y + c and must remain in the interval however small 

c becomes, it follows that they must all approach y. Hence the limit 
of (3) as c becomes zero is 


Í * f(s, y)ds 
A) e 


Since this relation holds true for any a and b, it follows that 


P(a < x < b|y) = 


fe) 
Jal) -f4 (5) 
84 


l 
| 


INDEPENDENCE 84.7 


By similar reasoning, if fi(z) is the marginal density of x, the condi- 
tional density of y given x is 


Jol) = 19 6) 


The function f(z|y) is a function of one variate z; y is simply a param- 
eter and will have some numerical value in any specific conditional 
density. Thus f(y) is to be regarded as a constant. The joint 
density f(x, y) plots as a surface over the z, y plane. A plane perpen- 
dicular to the z, y plane which intersects the z, y plane on the line 
y = c will intersect the surface in the curve f(z, c). The area under 
this curve is 


idea f(z, c)dz = falc) 


hence if we divide f(a, c) by f»(c), we obtain a density function which is 
precisely f(x|c). 
For the particular function 
Fz, y) -346—2—y  0«z«22«y«4 
=0 otherwise 


we have found in the preceding section that the marginal density of « 
is 
filz) = 14(8 — x) 0<x<2 
=0 otherwise 


In view of (6) the conditional und of y for fixed x is therefore 
gom lI 4 
fal) = S axes 
Conditional ERE are defined analogously for multivariate 
distributions. Thus for five variates with a density f(x: 22, 2s, t4, 25), 
the conditional density of 23, z», 24, given specific values of £s and zs, is 


BEICTEDEZEZED 
fv, 22, alts, £s) = Fos(s, 5) 


where fss(£3, £s) represents the marginal density of x3 and zs. 

4.7. Independence. If the conditional density f(xly) does not 
involve y and if the range of the conditional density does not depend 
on y, then x is independent of y in the probability sense. Suppose 
that this is the case and that we represent f(z|y) by g(x). Since, from 
Sec. 6, 

f(a, y) 
f(xly) = g@) = SG (1) 
85 


§4.8 DISTRIBUTIONS FOR CONTINUOUS VARIATES 


it follows that 
Fe, v) = gf) (2) 


hence the joint density of x and y is the product of two functions, one 
involving x only and the other involving y only. If we integrate (2) 
with respect to y over the whole range of y, we find that g(x) is simply 
the marginal density of x. Thus we may state: 

If two variates x and y are independent in the probability sense, then 

their joint distribution is equal to the product of their marginal distribu- 
tions. 
The converse of this statement is also true. That is, if f(x, y) can be 
factored into two functions, one involving x only and the otherinvolv- 
ing y only, and if the ranges of x and y do not depend on each other, 
then x and y are independent in the probability sense. 

In general, if the conditional distribution of a subset of any set of 
variates is independent of the remaining fixed variables, then that sub- 
set is said to be independent of the remaining variables in the prob- 
ability sense. The function defined in equation (3.6) provides an 
illustration: 


f(xi 2, 23,24) = lÓrurervs, O < a; <1 forall? 
=0 otherwise 


The marginal density of, say, xz and 2, is 


es fi J(xs, 22, 23, 24)dz1 dza 


= 4ra, O<a2<10<a%<1 
=0 otherwise 


ugs, 24) 


Hence the conditional density of x; and 23 is 


F (x1, Talta, x) —4ne 0 < 21 <1,0<a3<1 
=0 otherwise 


This function and its range do not involve zs and x4, so that the pair of 
variables (i, x) is independent of the pair (zs, x4) in the probability 
sense. In fact, all four variates of this distribution are mutually 
independent as may be deduced from the fact that the function may 
be factored into four functions each involving only one of the variates. 


4.8. Problems 


A If f(x) = 2x when 0 < x < 1 and zero otherwise, find the prob- 
ability that (a) z < 14; (b) 4 <2 <4; (c) z > 34 given z > 14. 
86 


PROBLEMS 84.8 


2. Define a density function using the function a(2 — x) over the 
range 0 « z « 2. Find the probability that a < x < b if 5 


0<a<b<2 
ifa <0<2 <8, 

3. If f(x) = 3a? when 0 <a <1 and zero otherwise, find the 
number a such that « is equally likely to be greater than or less than a. 
Find the number b such that the probability that z will exceed b is 
equal to .05. 

4. A variate x has the density f(x) = z/2 when 0 < x < 2 and 
zero otherwise. If two values of z are drawn, what is the probability 
that both will be greater than one? If three are drawn, what is the 
probability that exactly two will be greater than one? 

5. A variate x has the density f(x) = 1 when 0 < x < 1 and zero 
otherwise. Determine the number a such that the probability will be 
-9 that at least one of four values of z drawn at random will exceed a. 

6. Suppose the life in hours of a certain kind of radio tube has the 
density f(x) = 100/x* when x > 100 and zero when z < 100. What 
is the probability that none of three such tubes in a given radio set 
will have to be replaced during the first 150 hours of operation? What 
is the probability that all three of the original tubes will have been 
replaced during the first 150 hours? 

7. A machine makes bolts with diameters distributed by the density 
f(x) = K(x — .24)?(x — .26)? when .24 <a <.26 and zero other- 


wise. K is the number which makes Jo. f(a)dz = 1. Bolts must be 


scrapped if their diameters deviate from .25 by more than .008. What 
proportion of the bolts may be expected to be scrap? 

8. A bombing plane carrying three bombs flies directly above a 
railroad track. If a bomb falls within 40 feet of the track, the track 
will be sufficiently damaged to disrupt traffic. With a certain bomb- 
sight the density of points of impact of a bomb is 


F(x) = (100 + 2)/10,000 —100«2z«0 
= (100 — 2)/10,000 0 « x « 100 
=0 elsewhere 


t represents the vertical deviation from the aiming point, which is the 
track in this case. If all three bombs are used, what is the probability 
that the track will be damaged? 
9. Referring to the above problem, the plane can carry eight bombs 
of a smaller size, but one of these must hit within 15 feet of the track 
87 


84.8 DISTRIBUTIONS FOR CONTINUOUS VARIATES 


to damage it. Should the lighter or heavier bombs be used on this 
mission? 4 

10. A country filling station is supplied with gasoline once a week. 
If its weekly volume z of sales in thousands of gallons is distributed by 
f(x) = 5(1 — 2), 0 < x < 1, what must be the capacity of its tank in 
order that the probability that its supply will be exhausted in a given 
week shall be .01? 

11. A batch of small-caliber ammunition is accepted as satisfactory 
if none of a sample of five shots falls more than 2 feet from the center 
of a target at a given range. If r, the distance from the target center 
of a given impact point, actually has the density 

2re" 


f(r) = -=e 
0 <r <3, for a given batch, what is the probability that the batch 
will be accepted? 

12. If f(x, y) = 1 when 0 < z < 1,0 < y < 1, and zero otherwise, 
find the probability that (a) z < 1$, y < 1$; (b) z+ y < 1; (c) z + 
y > 1; (d) z >2y; (e)  » 1; (f) x? +y? < 14; (g) £ = y; (h) v > 14 
given y < 14; (i) z > y given y > 14. 

13. If f(z, y) = e-=™ when z > 0, y > 0, and zero otherwise, find 
P(x >1);Pa<a+y<b)if0 <a <b; P(x < ylz < 2y). 

14. Using the distribution of Prob. 13, find the number a such that 
P(z +y <a) = M. 

15. If three points (x, y) are drawn at random where z and y are 
distributed by the function given in Prob. 13, what is the probability 
that at least one of them will fall in the square 0 < z < 1,0 <y <1? 

16. A machine makes shafts with diameters a, and a second machine 
makes bushings with inside diameters y. Suppose the density of x 
and y is f(x, y) = 2500, 49 < x < 51, .51 < y < .53, and zero other- 
wise. A bushing fits a shaft satisfactorily if its diameter exceeds 
that of the shaft by at least .004 but not more than .036. What is 
the probability that a bushing and shaft chosen at random will fit? 

17. Find and plot roughly the cumulative distribution for the dis- 
tribution given in Prob. 6. Use the cumulative distribution to find 
P(150 < z < 250). 

18. Find and plot roughly the cumulative distribution for the func- 
tion given in Prob. 13, and use it to find Pil<2<23<y <4): 

19. Find the marginal density of z for the distribution of Prob. 13: 
(a) by integrating out y; (b) by using the result of Prob. 18 to get the 
cumulative marginal distribution, then differentiating the result. 

88 


| 


PROBLEMS 848 


20. Find the conditional density of x given y for the distribution of 
Prob. 13. What is the P(0 < x < 1|y = 2)? 

21. If f(x, y) = (n — 1)(n — 2)/(1 + z + y)? when z > 0,9 » 0, 
and zero elsewhere, find F(a, y), f(z), Fi(z), Jule). 

22. If f(z,-y) = 24y(1 — x — y) over the triangle bounded by the 
axes and the line z + y = 1, find f(z|y). 

23. If f(x, y) = 3z, 0 « y <2, 0 « x <1, find the conditional 
density of x. 

24. If f(z|y) = 32?/y*, 0 <a < y, and f«(y) = 5y*, 0 < y < 1, find 
P(x > 1) e 

25. If f(x, y, z) = 8ryz 0€ «1, 0«y « 1, 0<z< 1, find 
P@<y <2). 

26. If f(z) = 1/(1 +), x > 0, find the density of x given that 
gl 

27. If f(x, y) =1, 0«z «1, 0<y «1, find the conditional 
density of z and y given that y < z", n > 0. 

28. If f(x) = 1, 0 < x < 1, find the density of y = 3z +1. (Find 
first the cumulative distribution of y and then differentiate it.) 

29. If f(x) = 2ue~**, x > 0, find the density of y = a? 

30. If fi, y) - 1, 0«z <1, 0 «y «1, find the density of 
2=a2-+y. 

31. If f(x, y) = eC, x > 0, y > 0, find the density of 


(x +y) 
Me eer rus 


32. If f(x, y) = 4aye-*), a > 0, y > 0, find the density of 
z= yr? + y. 

33. If f(x, y) = 4zy, 0 < x < 1,0 < y <1, find the joint density of 
u = z? v = y4? 

34. If f(x, y) =3r, 0<y<a, 0 « z « 1, find the density of 
2-—g5—y. ; 

35. If f(z) = (1 + z)/2, —1 < x < 1, find the density of y = a. 

36. If f(z, y) - 1,0 «z <1, 0«y « 1, find the density of z 
defined by:z — z-J-yifz--y «lande z-Fy —lifz y 1. 

37. If fiz, y) =e, z 7 0, y 7 0, find the joint density of 
u=2x-+yandv=2. What is the marginal density of v? 

38. If f(a, y, z) = CHa, zx > 0, y 70, z > 0, find the density of 
their average u = (£ + y + z)/3. 

39. If f(x, y) = 4x(1 — y), 0 < x < 1,0 € y < 1, find the density 
of x given that y < 14. 

89 


848 DISTRIBUTIONS FOR CONTINUOUS VARIATES 


40. If xis distributed by f(x), x > 0, find the density of y = az? + b, 
a 0. 

41. If x is distributed by f(x), —«» « z < c, and if y = y(x) is 
any increasing function of c [i.e., y(z:) > (zo) when zı > To], find the 
density of y. 

42. If f(x, y) = g(x)g(y), > 0, y > 0, find P(x > y). 

43. If f(x, y, z) = g(z)g(y)g(2), x > 0, y > 0, z > 0, what is the 
probability that the coordinates of a randomly drawn point (x, y, 2) 
will not satisfy either z > y > z ore < y < z. 

44. In which of the distributions defined in Probs. 21, 22, 23, 24, 31, 
32, 33, and 34 are the variates independent in the probability sense? 


CHAPTER 5 
EXPECTED VALUES AND MOMENTS 


5.1. Expected Values. The expected value of a random variable 
or any function of a random variable is obtained by finding the aver- 
age value of the function over all possible values of the variable. To 
consider a specific example: If three coins are tossed, the distribution 
of the number of heads that appear is the binomial 


f(@) = C) ey 2-01,28 m 


For a specific value of z, say z — 2, we think of S(2) = 3% as the rela- 
tive frequency with which two heads will appear in a large number of 
trials. Thus in 1000 trials we expect no heads to appear in about 
1000 X 1$ = 125 trials, one head to appear in 1000 x 3¢ = 375 trials, 
two heads in 375 trials, and three heads in 125 trials. Now let us 
find the average number of heads in the 1000 trials. The total number 
of heads is expected to be 


125 X 0 + 375 X 1 4-375 X 2+ 125 X 3 = 1500 


in the 1000 trials; thus the average is expected to be 1.5 heads per 
rial. This is the expected value, or mean value, of x. It is clear that 
the same result would have been obtained had we merely multiplied 
all possible values of by their probabilities and added the results; 
thus, 

0X +1X3%+2X3%+3XK =15 


The expected value is a theoretical or ideal average. We do not actu- 
ally expect æ to take on its expected value in a given trial; in fact that 
would be impossible in the present example. However, we might 
reasonably expect the average value of x in a great number of trials 
to be somewhere near the expected value of a. 

These considerations lead us to define in general the expected value 
of a discrete variate as Daf(x), where f(x) is the distribution of x and 
the sum is taken over the whole range of z. The symbol E(x) is used 
to denote the expected value of x. Thus in the illustrative example 

91 


$5.1 EXPECTED VALUES AND MOMENTS 


E(x) = 5 afe) = 1.5 


z= 


In general, we shall define the expected value of any function of z, say 
A(x), as 
Efh(x)] = Y, hæfa) (2) 


Where the sum is taken over the whole range of z. Thus if 
h(x) =z +1 
and f(x) is as defined in equation (1), 
3 


2, + Ife) 


z=0 
='+2x3¢+5x3¢+10xX 1$ =4 
Similarly for several discrete variates x1, zs, * * * , zy, With distribution 
f(x z2, ` * 7 , xx), the expected value of any function h of the variates 
is defined to be 
Elh(ri, 32, * * + ,24)] 

=A Dhesa +++, flr, zs 55,2) (9) 
Tk 


zi ox? 


E(a? + 1) 


where the sums are taken over the entire range of each variate. 

For continuous variates we define expected values in terms of inte- 
grals rather than sums. If z has the distribution f(x) and h(x) is any 
function of z, then 


EREN = f 7, hGfG)dz (4) 
This definition is suggested by the definition for discrete variates given 
in equation (2) together with the definition of a definite integral as the 
limit of a sum. Let the x axis be divided into intervals of length 
Az; (í.— 0, +1, +2, - - +) and let z} be a point in the intervaleAz; 
such that f(z))Az; equals the area under f(x) over Az; Then an ` 
expected value of h(x) may be computed by regarding x as a discrete 
variate which ean take on only the values z; with the probabilities 
f(a{)Az;. This expected value is 


DELI 
i-e 
according to equation (2). The limit of this sum as all Az; approach 
zero will essentially remove the restriction that z be discrete, and the 
92 


MOMENTS 85.2 


limit is the integral given in (4). Similarly for several continuous 
variates, we define 
E[h(zi zs, +> = 3 &x)] 


s B 24 res AG 2s, `` + , TSE zs, >> + m)dz 
e BY 


We shall avoid confusing the expected-value notation with thefunc- 
tional notation by never using the letter E to represent a function. 
E(g) will always represent the expected value of g, never a function 
E of g. In the remainder of this chapter we shall not distinguish 
between discrete and continuous variates, Expected values will 
always be given in terms of integrals, but it is to be understood that the 
integrals are to be replaced by sums in specific problems which deal 
with discrete variates. 

Two simple properties of E are worth noting. If x is distributed by 
J(«), if c is any constant, and if g(x) and h(x) are any functions of 2, 
then 

Eleg(x)] = cH[g(x)] (6) 
Elg(x) + h(x)] = Elg(z)] + E[h(x)] (7) 


These two relations follow directly from the corresponding relations 
for integrals: 
Seg(a)f(x)dx = cfg(x)f(x)dx 
Sig) + ^(x)]f(z)dz = Jg(x)f(w)dx + fh(x)f(z)dz 

Of course (6) and (7) remain true if the single variate x is replaced by a 
set of variates x1, zs, * - * , dy. 

5.2. Moments. The moments of a distribution are the expected 
values of the powers of the random variable which has the given dis- 
tribution. The rth moment of z is usually denoted by u; and is 


" i = Ea") = Ju: a'f(x)dz .. 
The first moment uz is called the mean of x. The moments about any 


arbitrary point a are defined as 


Ele — ay] = [7 @—ayfe)de (2) 


and when a is put equal to the mean, we have the moments about the 
Mean, which are usually denoted by pr: 
me = Elo — ui) = [7  — ifs G) 
93 


§5.2 EXPECTED VALUES AND MOMENTS 


We have 


m= |7 afede — m | Sode 
=m- wh = 0 (4) 
and 


m= [7,6 — fade 
= [7 e — 2em + Ui @de 


= us — Quix + (m)? 
= m — (m)? (5) 
This second moment about the mean is called the variance of x. 

The mean value of a variate locates the center of its distribution in 
the following sense: If the « axis is thought of as a bar with variable 
density, the density at any point being given by f(x), then it is shown 
in elementary calculus that the value x = uj is the center of gravity 
of the bar. Thus the mean may be thought of as a central value of the 
variate. For this reason it is often referred to as a location parameter 
—it tells one where the center of the distribution (in the center-of- 
gravity sense) lies on the z axis. Other central values are sometimes 
used to indicate the location of a distribution. One is the median, 
which is defined as the point at which a vertical line bisects the area 
under the curve f(x). The median is therefore the point u”, say, such 
that 


[E sod = 4 = [7 fea (6) 


Another central value for densities with one maximum is the mode, 
which is the point at which f(x) attains its maximum. One could 
easily devise other central values; these are the ones éommonly used, 
and of the three the mean is by far the most useful. We shall often 
employ the symbol » without the prime or subscript to denote the 
mean. 

The variance us of a distribution is a measure of its spread, or dis- 
persion. If most of the area under the curve lies near the mean, the 
variance will be small; while if the area is spread out over a consider- 
able range, the variance will be large. Distributions with different 
variances are plotted in Fig. 30 in the following chapter. The variance 
is necessarily positive, since it is the integral or sum of positive quan- 
tities. It will vanish only when the distribution is concentrated at 
one point, i.e., when the distribution is discrete and there is only one 
possible outcome. The symbol c? is commonly used to denote the 

94 


MOMENTS 85.2 


variance; the positive square root of the variance, c, is called the 
standard deviation. 

We shall look a little further into the manner in which the variance 
characterizes the distribution. Suppose f,(x) and f2(x) are two densi- 
ties with the same mean such that 


[A ico — fis > 0 (7) 


Ai (x) 


Fia. 24, 


(xay 


g(x) 


4 x 
Fie, 25. 


for every value of a. Two such densities are illustrated in Fig. 24. 
Tt can be shown that in this case the variance oj of the first density is 
smaller than the variance c2 of the second density. We shall not take 
the time to prove this in detail, but the argument is roughly this: Let 


g(x) = file) — fa) 


where fi(z) and fo(x) satisfy (7). Since se g(x)dx = 0, the positive 
area between g(x) and the z axis is equal to the negative area. Fur- 
thermore, in view of (7), every positive element of area g(z')dz' may 


be balanced by a negative element g(z")dz" in such a way that 2” 
95 


85.2 EXPECTED VALUES AND MOMENTS 


is farther from » than g’. When these elements of area are multiplied 
by (x — u)?, the negative elements will be multiplied by larger factors 
than their corresponding positive elements; hence 


J- € - rocas <0 


unless fi(z) and f(x) are equal. Thus it follows that e? < 03. 

The converse of these statements is not true. That is, if one is told 
that c? < oł, he cannot conclude that the corresponding densities 
satisfy (7) for all values of a, though it can be shown that (7) must be 
true for certain values of a. Thus the condition o? < gł does not give 
one any precise information about the nature of the corresponding 


A(x) 


Alx} 


^ x 
Fra. 26. 


distributions, but it is evidence that fi(z) has more area near the mean 
than f(x), at least for certain intervals about the mean. The two 
densities in Fig. 26, for example, might have about equal variances, 
and one could alter either one slightly so as to make it have a smaller 
or larger variance than the other. 

The third moment us about the mean is sometimes called a measure 
of asymmetry, or skewness. Symmetric distributions like those in 
Figs. 26 and 30 can be shown to have us = 0. A curve shaped like 

‘Si(x) in Fig. 27 is said to be skewed to the left and can be shown to 
have a negative third moment about the mean; one shaped like f2(x) 
is called skewed to the right and can be shown to have a positive third 
moment about the mean. Actually, however, knowledge of the third 
moment gives almost no clue as to the shape of the distribution, and 
we mention it at all mainly to point out the fact. Thus, for example, 
the density f(x) in Fig. 27 has us = 0, but it is far from symmetric. 
96 


MOMENTS 85.2 


By changing the curve slightly we could give it either a positive or 
negative third moment as we pleased. 

While a particular moment or a few of the moments give little 
information about a distribution, the whole set of moments (ui, ui, 
Hs * * 7) will ordinarily determine the distribution exactly, and for 
this reason we shall have occasion to use the moments in theoretical 
work. 

In applied statistics, the first two moments are of great Importance, 
as we shall see, but the third and higher moments are rarely useful. 
Ordinarily one does not know what distribution function he is working 
with in a practical problem, and often it makes little difference what 
the actual shape of the distribution is. But it is usually necessary 
to know at least the location of the distribution and to have some idea 
of its dispersion. These characteristics can be estimated by examining 


A(x) 


Fia. 27. 
a sample drawn from a set of objects known to have the distribution in 
question. This estimation problem is probably the most important 
problem in applied statisties, and a large part of this course will be 
devoted to a study of it. 
Illustrative example: Find the mean and variance of the hypergeo- 
metrical distribution 


CX.) 
fa) - A 22012 :::,k (8) 
m+n 
me 
This problem will illustrate a technique that may be used to find the 
moments of a great many discrete distributions. The first step is to 


use the distribution to determine an identity in the parameters. Since 
Zf(r) = 1, it follows that 


. m n (mcn 
> = (9) 
z/\k-—=@ k Á 
z-0 

for any positive integral values of m, n, and k.  [Actually, as we have 
seen before, the range depends on the relative sizes of m, n, and k, but 
We can avoid dealing with these details by defining the binomial 

97 


$5.2 EXPECTED VALUES AND MOMENTS 


coefficient n 


negative.] 
The mean of the distribution is 


a! P y 
) oe oj to be zero when either b or a — b is 


k 
u = E(x) = 
z=0 


(10) 


Tn this expression z may be canceled with the x in the denominator of 


m 
©) to get 


and we have 


(11) 


i m+n 
k 
where we have written the sum to range from 1 to k because the first 
term in (10) vanishes and may be omitted. Actually, since we have 
defined a binomial coefficient to be zero when its lower index is nega- 
tive, there would be no objection to leaving the limits 0 tok. Now in 
this last expression let us substitute y for z — 1 and factor out factors 
which do not involve the summation index. We get 
k-1 


-eBiGOMen) € 


This sum may be evaluated by means of the identity (9); we simply 
replace m by m — 1 and k by k — 1 in the right-hand side of (9) to 
get 
ES m m —1-4-m5 
“Cham 


k 


(13) 


MOMENTS $5.2 


To get the variance, we shall need the second moment 


k 


w= OY af) 


z-0 


If we substitute directly for f(x), we shall be able to cancel only one 
of the 2’s, and the other z will remain to prevent our using the identity 
to evaluate the sum. The trick here is to write x? in the form 


z(r—1)-cz 
to get 
ug = ZXz(z — 1)f(x) + Xuf(v) (14) 


We have already evaluated the second sum in obtaining the mean, and 
the same procedure used on the first sum gives 


i069 
Co 
lec) 
m+n 
EM ry n 
(rey Cr De- 
Foie 
m+n po 
k 


_ m(m-— 1)k(k — 1) (15) 
~ (m+ n)(m +n — 1) 
Adding (13) to this, we get uj in accordance with (14); the variance 


is then obtained by subtracting the square of (13) from yj in accord- 
ance with (5). "Thus the variance is 


2 — _m(m — 1)k(k — 1) mk ( mk j 
(m+n)(m+n—1) m+n m+n 


_ mnk(m + n — k) (16) 
o (m+ n)m +n — 1) 
99 


E[z(r — 1) = 


$5.3 EXPECTED VALUES AND MOMENTS 


The general method for higher moments is now evident. To get 
the third moment, we would find the expected value of 
x(x — l)(r —2) 

since this is equal to x? — 3x? + 2x, we have 

uy — 842 + 2ui = E|z(z — 1)(x — 2)] 
and having evaluated the right-hand side of this expression, we could 
solve for 3, since u, and uj have already been determined. Having 
the third moment, we could obtain the fourth by finding the expected 
value of a(x — 1)(y — 2)(e — 3), then solving for yj, in 

my — 605 + llu, — 6u; = Ela(x — 1)(z — 2)(x — 3)] 

The right-hand side of this last expression is called the fourth factorial 
moment of the distribution. The rth factorial moment is 

E|r(r — 1)(@ — 2) -- - (xr —r4- 1)] 


Illustrative example: Find the mean and standard deviation of 
the continuous distribution f(x) = 2(1 — z), 0 « z « 1. The rth 
moment is 


ui = Ele) = J 1721 — 2)dx 
= 2 (at — att) dy 


2 
~ & FDC +2) 
The mean is 


and the variance is 


hence 


5.3. Moment Generating Functions. When all the moments of a 
distribution exist (i.e., when all moments are finite), it is possible to 
associate a moment generating function with the distribution. This 
is defined as H(e*), where x is the random variable and t is a continu- 
ous variable; the expected value of e” will be a function of £ which we 

A 100 


MOMENT GENERATING FUNCTIONS §5.3 


shall denote by 
m() = Ele) = f° efa)dz (1) 


If we differentiate the members of this relation r times with respect to 
t, we have 


Ca E 
ap n) = [T refads (2) 
and on putting ¢ = 0, we find 
d 
ag "(0 = E(v) = u (3) 


where the symbol on the left is to be interpreted to mean the rth 
derivative of m(t) evaluated at ¿ = 0. Thus the moments of a distri- 
bution may be obtained from the moment generating function by 
differentiation. 

If in equation (1) we replace e“ by its series expansion, we obtain 
the series expansion of m(t) in terms of the moments of f(x); thus 


m(t) — a(t + xt + d (at)? + zi (a)? + -- ) 
1 
= 1+ at + apa + 
2 à ut (4) 
i=0 ° 
from which it is again evident that u. may be obtained by differentiat- 
ing m(t) r times and then putting ¢ = 0. 


We may illustrate this technique for finding moments by obtaining 
the mean and variance of the Poisson density: 


f@) =F 2=0,1,2--- 


az! 
We find 
ete *a* 
m() = Ee) = Y 5 
z=0 
oe fae 
zi 
z=0 
= ee 


85.4 EXPECTED VALUES AND MOMENTS 


The first two derivatives are 
m'(t) = e~*aete™* 
m” (t) = e~*aete(1 + ae) 
whence 
pw =m'(0) =a 


uj = m"(0) = a(l + a) 
o=a(l+a)—a=a 


l 


The factorial moment generating function is defined as E(t*), and the 
factorial moments are obtained from this function in the same way 
as the ordinary moments are obtained from E(e**) except that t is put 
equal to one instead of zero. "This function sometimes simplifies the 
problem of finding moments of discrete distributions. It is, however, 
of no help in the example used in the preceding section, because the 
sum ZXí*f(r) has no simple expression. For the Poisson distribution: 


E(t?) = ex» 
whence 


t=1 


E(z) = atd] =a 


Efe(e — 1)] = are? | =a 
giving the same moments as before. 
Sometimes we shall have occasion to speak of the moments of a 
function of a random variable. Thus we may want the moments of 
h(x), where x has the distribution f(z). The rth moment of h(z) is 


BUG) = [7 We)hy@)az (5) 
and a function which will generate the moments is obviously 
E(e*@)) = yos era f(x)dx (6) 


5.4. Moments for Multivariate Distributions. The preceding ideas 
are readily extended to distributions of several variates. Suppose, 
for example, that we have three variates (z, y, z) with density f(z, y, 2). 
The rth moment of y, for example, is 


Ey’) = he jt ‘fing yS (x, y, z)dz dy dx (1) 


Besides the moments of the individual variates, there are various joint 
moments defined in general by 
102 


THE MOMENT PROBLEM §5.5 


E(eyz) = flrs jos hae z*yz'f(z, y, z)dz dy dz (2) 


where q, r, and s are any positive integers including zero. The most 
important joint moment is the covariance, which is the joint moment 
about the means of the product of two variates. "Thus the covariance 
between z and z is 


ea = ff", S e- Eee- Eee y, dae dy ae 8) 


and there are two other covariances Szy and oy, defined analogously. 
The correlation between two variates, say x and e, is denoted by pzz 
and is defined by 


[n 
peret (4) 


Where c; and c; are the standard deviations of x and z. 
Also one can define a joint moment generating function: 

m(th, ts, ts) = E (eztut) (5) 
It is clear that the rth moment of Y, for example, may be obtained by 
differentiating the moment generating function r times with respect 
to / and then putting all the ts equal to zero. Similarly the joint 
moment (2) would be obtained by differentiating the function q times 
with respect to £;, r times with respect to /», s times with respect to 
ts, and then putting all the ts equal to zero. 

5.5. The Moment Problem. We have seen that a distribution f(x) 
determines a set of moments (ui, u$, us, * * "). One of the important 
problems of theoretical statistics is to find f(x) when the moments are 
given. A study of this problem requires advanced mathematical 
techniques, and we shall have to omit it. However we shall prove the 
following theorem which will be required in our later work: a 

If two continuous densities have the same set of moments and if the 
difference of the densities has a series expansion about the origin, then the 
two densities are equivalent. 

Suppose the two densities are represented by f(x) and g(x) and suppose 
the series expansion of their difference is 


f(x) — g(x) = co + eic + et + +++ 


Now let us consider the integral 
[te = sepas = JZ. ot ext + cx? + + UG) — gos 


esL 1) eiut — 21) t 
L0 
108 


$5.6 EXPECTED VALUES AND MOMENTS 


since the two densities are assumed to have the same moments. The 
function [f(z) — g(«)]? is necessarily positive or zero, and as we have 
found the area under the function to be zero, we must conclude that 
the function is zero and hence that 


F(x) = g(x) 


Under the conditions of this theorem it follows that 

If two random variables have the same moment generating function, then 
they have the same density function. 
For if the variables have the same moment generating function, they 
necessarily have the same moments. 


5.6. Problems 


1. If 5000 lottery tickets are sold at $1 each on a $2000 car, what 
is the expected gain of a person who buys three tickets? 

2. A coin is tossed until a head appears; what is the expected num- 
ber of tosses? 

3. A bowl contains n chips numbered from 1 to n; m are drawn 
without replacement; what is the expected value of the sum of the 
numbers drawn? 

4. An event occurs with probability p and fails to oceur with prob- 
ability q = 1 — p. Ina single trial, what are the mean and variance 
of z, the number of successes? 

5. If n trials are made of the event described in Prob. 4, and if z 
is the total number of successes, what are the mean and variance of 
z? 

6. Find the mean of the continuous variate x distributed by 


fz) = TCUCSA —o «g-« o 


T. Find the mean and variance of z if f(z) — TO edu 
. Find the mean and variance of 2z? if f(z) = 1,0 < z < 1. 
. Find the mean and variance of «x if 


f@)=1/@+? 0<2t<o 


10. Show that E(ry) = E(x)E(y) when x and y are independently 
distributed. 
11. Show that 


oo 


w= y () eno 


i 
104 


PROBLEMS §5.6 


12. What is the median of x ESO — (1.— 2),0<2<1? 

13. Find the moment generating function associated with the 
density f(x) = ae-, x > 0, and use it to obtain the mean and variance 
of x. 

14. Find the factorial moment generating function for the binomial 
distribution, and use it to obtain the third moment p4. 

15. If z has the density f(z) = z/2,0 < z < 2, find the rth moment 

'ofz*. Then show that y = x? has the distribution 


gu)-14 O<y<4 


by showing that y has the same moments as x. 

16. If f(a, y) = ae), y > 0, y > 0, find the generating function. 
for the moments of u = z +y. Deduce the distribution of u from 
the form of this generating function. 

17. Show that if a density function f(x) is symmetric about a point, 
say b, [i.e., f(b + c) = f(b — c) for every value of c], then that point 
must be the mean of z. Show also in this case that all odd moments 
about the mean must be zero. 

18. Given the moment generating function m(t) for the moments yu 
about the origin, how would one obtain the moment generating func- 
tion for the moments u, about the mean? 

19. In place of the moments x}, another infinite set of constants y, 
called the cumulants of a distribution is often useful for characterizing 
the distribution function. The cumulants are defined by the generat- 
ing function c(t) = log m(t), where m(t) is the generating function 


for the yf, ie, y, = ee) evaluated at t = 0. Show that y; = p/ 


and y: = g?, 
20. Find the rth cumulant y, for the density f(x) = ae, x > 0. 
21. Show that if M(t) generates the moments about an arbitrary 


point b, i.e., 
M()- JE e'e»f(z)dz 


then C(t) = log M(2) will correctly generate all the cumulants except 
the first. The cumulants of a distribution beyond yı are thus said 
to be invariant under translations of the variate. 
22. If x has cumulants Yr show that y = kx has cumulants k’y,. 
23. Show that the correlation between two variates is zero if they 
are independently distributed. (The converse of this statement is 


not true, as the following problem shows.) 
105 


§5.6 "EXPECTED VALUES AND MOMENTS 
24. Let x have the marginal density fi(z) = 1, —14 < x < 14, and 
let the conditional density of y be 
f(ylz) 2 1 a<y<2r4+1,-4%<2<0 . 
=1 —t<y<1-20<2<\¥% 
=0 otherwise 


Find the correlation between x and y. 
25. Could the function E[1/(l + tx)] be used to generate the 
moments of a variate x? 


106 


CHAPTER 6 
SPECIAL CONTINUOUS DISTRIBUTIONS 


6.1. Uniform Distribution. The simplest distribution for a con- 
tinuous variate is the uniform density: 


fe) - 


BET [E (1) 
=0 


otherwise 


which is plotted in Fig. 28. The probability that an observation will 
fall in any interval within a < æ < 8 is equal to 1/(8 — a) times the 


Fia. 28. 


length of the interval. "The distribution is particularly useful in theo- 
retical statistics because it is convenient to deal with mathematically. 
We are enabled to deal only with this simple distribution when dis- 
cussing certain properties of distributions in general by the following 
theorem: 
Any density for a continuous variate x may be transformed to the uni- 
form density 
TY) = T (2) 


by letting y = G(x), where G(x) is the cumulative distribution of x. 

It is clear that y must have range zero to one since a cumulative dis- 

tribution must vary between zero and one. We need only show that 

the density of y is f(y) = 1 over that range. Now a value of y is 

determined by drawing a value of x, say zo, and substituting in G(x) 
107 


§6.2 SPECIAL CONTINUOUS DISTRIBUTIONS 


to get a corresponding yo = G(ao). The transformation y = G(x) sets 
up a correspondence between points of the z axis and points on the 
interval (0, 1) on the y axis. To find the probability that y lies in an 
interval, say a < y < b, we find the values, say a’ and b’, on the x axis 
which correspond to a and b, as in Fig. 29, and compute the probability 
for that interval (a’, b’) in terms of x. Thus, 

Pa «y <b) = GQ’) — Ga’) 
but by definition G(b’) = b and G(a’) = a; hence 


Pa<y<b)=b-a Weg <b 1 


0} ---—-------—-——-—---- --=-s == — 


Fra. 29. 


Suppose we denote the cumulative distribution of y by F(y); then 
Fb) — F@) -b —a 
and replacing b by y + Ay and a by y, we get 


Fly + Ay) — Fy) .. i 
Ay 


The limit of the expression on the left as Ay approaches zero gives the 
derivative of the cumulative distribution, which is the density we seek: 
d a Fy + Ay) — Fy) 
mM = lin ~ a] 
fü) = g FO) = lim, n o<y<1 
which proves that y has the density (2). The transformation y = G(x) 
is called the probability transformation. 

By means of this theorem it is possible to demonstrate many prop- 
erties of continuous distributions in general by proving them merely 
for the uniform distribution over the unit interval. 

6.2. The Normal Distribution. A great many of the techniques 


used in applied statistics are based upon the normal distribution, and 
108 


THE NORMAL DISTRIBUTION 86.2 


much of the remainder of this course will be devoted to a study of this 
distribution. "The density is 


n(x) = 


—(2—p)*/202 = 

Dum € co <gr< w (1) 
and the function is plotted in Fig. 30 for several values of c. Changing 
u merely shifts the curves to the right or left without changing their 
shape. The function given actually represents a two-parameter fam- 
ily of distributions, the parameters being » and o?. We have used the 
symbols » and c? to represent the parameters because the parameters 
turn out, as we shall see, to be the mean and variance, respectively, of 
the distribution. 


n(x) 


08 


os} o=05 


04 


Fre. 30. 
Since n(x) is given to be a density function, it is implied that 
fx n(x)dx = 1 


but we should satisfy ourselves that this is true. The verification is 
somewhat troublesome because this particular function does not inte- 
grate into a simple closed expression. Suppose we represent the area 
under the curve by A; then 


ULT * Ewe dy 
Vino foe 


and on making the substitution 


$6.2 SPECIAL CONTINUOUS DISTRIBUTIONS 


we find 
1 » 
A= = ew d 
/ on / -0 5 
We wish to show that A = 1, and this is most easily done by showing 
A? is one and then reasoning that A = 1, since f(x) is positive. We 
may put 


ew dy e” dz 


t NE. 


- x i3 rds giant dy dz 


writing the product of two integrals as a double integral. Tn this inte- 
gral we change the variables to polar coordinates by the substitution 
y =rsin 6 
z= r cos 0 
and the integral becomes 


yt 1 ioe = 
= re" dr d6 
o Jo 


Qn 

= ds rea” dr 
0 

=1 


Since the integral of n(x) does not have a simple functional form, we 
ean only exhibit the cumulative distribution formally as 


NG) = ae T etme di D 
and if we let 
y=! — 
we find 
NG) = em UU e dy 68) 


and given a specific value for (æ — p)/o, the integral can be computed 
by numerical methods. A tabulation of this function may be found 
in Table II. Since the density is symmetric about y, i.e., since 


n(u — a) = n(u + a) 
110 


THE NORMAL DISTRIBUTION $6.2 


it follows that N(x) for (x — u)/o negative is equalto1 — N (2^), where 
(z' — u)/c = — (x — u)/c. The graph of N(z) is given in Fig. 31. 
To illustrate the use of the table, we shall find P(—1 < x < 4) 
when z has the density: 
1 
4 N/2r 


n(r) = e 1e-22/32] (4) 


We note that 
THES c —4 


and thus that the values of (x — 4)/e corresponding to —1 and 4 are 


-1-2 8 4-2. 


1 
n 4 TERI 


N(x) 


Fra. 31. 


P(-1 <2 <4) = N(4) - N(-1) 
= 6915 — (1 — .7734) 
= 4649 


It is a great convenience that N (x) is of such a form that it need not 
be tabulated for various combinations of values of y ando. The trans- 
formation y = (zx — »)/o brings all normal distributions to the same 
form, called the standard or normalized form. We shall reserve the 
letters n and N henceforth to indicate the normal density and its 
cumulative form, Often we shall wish to indicate the parameters, and 
this will be done by writing the functions as n(x; m, c?) and N (x; m, 0”), 
separating the parameters from the variate by a semicolon. In this 
notation the distribution (4) would be symbolized by n(x; 2,16). The 

111 


86.3 SPECIAL CONTINUOUS DISTRIBUTIONS 


standard normal distribution is then 


n(a; 0, 1) = edt (5) 


1 
V2 
and its cumulative form is 

N(æ; 0, 1) = [^ a(t; 0, dt (6) 

We shall now find the moments of n(x; p, o°) by finding first the 

moment generating function. The computation is as follows: 
m(t) = E(e*) = E(t) 


- 1 
gh gl) g-a/16 679? da; 
1 -o V mo 


2 
1 f Oet) da; 
-« 


v 2r 0: 
On completing the square inside the bracket, it becomes 


(s — à) — 2rfi(z — à) = (s — p)? — Bes — u) ett — et 
= (x — u — o*t)? — o't? 


I 


Il 


- gu 


and we have 


m(t) = einen. 


da e~an)? de 
J-a 
The integral together with the factor 1 //%& c is necessarily one, since 
it is the area under a normal distribution with mean p + o% and 


variance g?. Hence, 
m(t) = estem (7) 
On differentiating this function twice and substituting ¢ = 0 in the 
results, we find 
up uu 
uy = 0? + p? 
Variance = uj — (uj)? = e? 
' thus justifying our use of the moment symbols for the parameters. 
6.3. The Gamma Distribution. The function 


1 
dios cen cU (1) 
=0 ze 


is called the gamma distribution. This is a two-parameter family of 
distributions, the parameters being o and 8. 8 must be positive, and 
112 


THE GAMMA DISTRIBUTION §6.3 


o must be greater than minus one. The function is plotted in Fig. 32 
for 8 = 1 and several values of a. Changing 8 merely changes the 
scale on the two axes, as is evident on examining the form of the 
function. 

To show that the function represents a density (has unit area), we 
shall evaluate the integral 


A -f i ze e dz 
0 


pet 1 


= ^ev dj 
jJ. ye ay 


75 


«50; 


25 


[ D 2 3 4 5 6 7 8x 
Fra. 32. 


on substituting y for z/8; hence A is necessarily a function of o only. 
If a > 0, we may integrate at once by parts to obtain 


Base If aye dy 


=a he ye~ dy 
Whence it follows that 


A(a) = ye] 


A(a) = aA(a — 1) (2) 
If o is a positive integer, we may apply this recurrence formula (2) 
successively to obtain 
A(a) = a(a — 1)(@ — 2) - - - (2)(1)A(0) 
and since 
A(0) = s exdy 21 
113 


§6.3 SPECIAL CONTINUOUS DISTRIBUTIONS 


we have 
A(a) = a! 


when aisan integer. The function A (a) is often denoted by T'(a + 1) 
in mathematical literature, but we shall use the symbol a! whether or 
not o is an integer. 

In practically all applications of the distribution, a is either an 
integer or a multiple of one-half. Hence for our purposes we need 
only to evaluate (14)! in order to be able to compute a! for any value 
of a we may encounter. 


(4)! = 14(—14)! 
— i * ye dy 
and if we let y = 22/2, we have 


091a M fae wr ds 
Vr jE *Ol em de 
0 


V2 


since the integral is half the area under a normal density function and 
is therefore one-half. Knowing this number, we can evaluate a! for 
any multiple of one-half by using the relation (2); thus 


(54)! = 5404)! = 96 X 3404)! 


15 Ar 
8 


The cumulative distribution is 


z 
1 
F(«) = 1 wee weed sm > 0 (3) 
and is, of course, zero when <0. It must be evaluated by numerical 
methods unless a is a positive integer, in which case the function can 
be found by successive integrations by parts to be 


m tio cte del Nn ile EA er 
re) -1- [1424 3(2) «AG eye 


z>0 (4) 
114 


THE BETA DISTRIBUTION §6.4 


But in any case it is usually simpler to refer to tables of the function in 
dealing with specific problems. "The function F(x) is called the incom- 
plete gamma. function and has been extensively tabulated by Karl 
Pearson (‘Tables of the Incomplete Gamma Function," Cambridge 
University Press, London, 1922). 

The moment generating function for this distribution is 


s 1 
m(t) = ji Ca apm xrel da; 
A 2 sty 1 ae~ d 
Alb CSE 
on substituting y for z/8. This may then be put in the form: 


m(t) = 1 yeeva-80 dy 


al Jo 


* ide el 
"uam, a gy 
T 
~ Bp m 


provided ¢ < 1/8, since the last integral represents the area under a 
gamma distribution with parameters a and p’ = 1/(1 — Bt), and is 
therefore one. On differentiating m(t) twice and putting ¢ = 0 in 
the results, we find 


p = B(a + 1) (6) 
by = B*(« + 1)(a + 2) (7) 
c? = B*(a + 1) (8) 
6.4. The Beta Distribution. The density 
j@) = GT DD e -zP O«z«1 (1) 
=0 elsewhere 


is called the beta density. The function represents a two-parameter 
family of distributions, and a few examples are plotted in Fig. 33. 
The parameters a and 8 must both be greater than minus one. The 
distribution becomes the uniform distribution over the unit interval 
when a = 8 = 0, 

To show that the area under f(z) is one, we shall compute the integral 


Ala, 6) = fy a — 2 de @) 
116 


§6.4 SPECIAL CONTINUOUS DISTRIBUTIONS 


Clearly A will be a function of e and 8; we wish to show that it is the 
reciprocal of the constant multiplier in (1). Referring back to the 
gamma distribution, we may write 


ol8! = ce erate! dz) e yer dy) 


= [© [aee 
M RE Ls dz dy 


04 06 
Fie. 33. 


and in this last integral we shall change the variable z to u by the 
substitution 


u-—t 
s Fy 
or 
=W | ydu 
E O TEET 


Since u obviously has the range zero to one, the integral becomes 


116 


OTHER DISTRIBUTION FUNCTIONS $6.5 


In this integral we change y to v by the substitution 
y = (1 — uw dy = (1 — u)dv 
to get 
alt = f° fy wet — uyethrte du do 


= (fete a [wa = u) du) 
Lare yf ue(1 — wu du 


which shows that A(a, 8) has the stated value. A(a — 1, 8 — 1) is 
called the beta function of a and 8 in the literature and is usually 
denoted by B(a, 8). 

The cumulative distribution, often called the incomplete beta func- 
tion, is 


F(x) =0 z«0 
[, iit ea d O<a2<1 8) 
0 al! 
=1 z>1 


and has also been extensively tabulated by Karl Pearson (‘Tables of 
the Incomplete Beta Function,” Cambridge University Press, London, 
1932). 

The moment generating function for this distribution does not have a 
simple form, but the moments are readily found directly: 


1 
“= E(x) = (a E 1)! ; arte(] — a)? dz 
— (a 4-8 -- Da o 7)! (^ (e FB +r +I! uuu sys 
lat 8 c-r 4 iiel Í, (a+ rie! arta(1 — x)? dz 
= (eB DN n) ex 
(e 4-8 4- r4 1)lel 
since the integral must be one. 


6.5. Other Distribution Functions. A distribution which we shall 
find useful for illustrative purposes is the Cauchy density 


1 1 
—» aiM =- <4< 0 1 
f(x) =H @—n)? ee z () 
which has a mean only in a restricted sense,and no higher moments. 


The cumulative distribution is 
117 


§6.5 SPECIAL CONTINUOUS DISTRIBUTIONS 


por? di 
Fe =f m (— 


1 
= = are tan (t — ») | 


1 1 
mS + = are mele — y) (2) 


Pearson’s Distributions. A general class of distribution functions is 
given by the families of solutions of the differential equation 


dy | (r--ay (3) 
dz bz: + Fd 


The equation was obtained by Karl Pearson by putting dy/dx equal 
to the slope of a straight line joining two successive points of the dis- 
crete hypergeometric distribution. The solutions of this equation 
were classified by Pearson into twelve families of curves, those of one 
family being called Type I curves, those of a second Type II, and so 
on. The gamma distributions are essentially the Type III curves of 
Pearson; the normal distributions are his Type VII curves; the beta 
distributions represent his Type I curves, while with a = 8 they 
represent his Type II curves. 

The different families of curves arise when different relations are 
assumed between the constants a, b, c, d in the differential equation. 
Thus, for example, when b and c are zero, the equation becomes 


and its solution is 
Dex 2 
log y = (£ + a)? +K 


or 
y = ke@t0)2724 


which becomes the normal density when d is taken to be negative and k 
is determined so as to make the area under the eurve equal to one. By 
considering various other conditions on the constants in (3), wecould 
derive all twelve of Pearson’s types of curves, but we shall not develop 
these because most of them have not proved to be of great importance 
in statistics. 

The Gram-Charlier Series. A wide class of density functions may 
be represented by an infinite series called the Gram-Charlier series. 


Suppose f(x) is a density function and suppose its mean and variance 
118 


OTHER DISTRIBUTION FUNCTIONS §6.5 


are u and c?. Let 


then y has zero mean and unit variance. The Gram-Charlier series is 
a series in the derivatives of the normal distribution of y. Let nily) 
represent the ith derivative of the standard normal density n(y; 0, 1) 
Thus 


ne(y) = <= evt 


1 
VS ae 
no(y) = (y? — 1)no(y) 


my) = —(y* — 3y)n«(y) 
and in general 


my) = —(y) e = —yno(y) 


ny) = Hily)no(y) 
where H;(y) is a polynomial of degree 7 in y called the ith Hermite 
polynomial. The Gram-Charlier theorem states that under rather 
general conditions f(z) may be put in the form 


f(x) = aono(y) + amily) + amy) + * 7 + 


> ains(y) 


mly) Sally (4) 


where the a; are constants and y = (a — u)/c. It can be shown that 


Hy) = 
yt) ucc me yO) a) oe le | 
( |v- y e ea 23X4 y (5) 
pex Hiy)Hi(y)no(y)dy =O  ifizj 
=a! itv=7 (6) 
We shall not prove these relations. By means of the second one we 
may determine the coefficients o; when f(x) is known and can be 


expressed by (4). Let equation (4) be multiplied by H;(y) and then 
integrated on both sides with respect to x after putting y = (x — u)/e. 


We find 
[ nee = at 


119 


ll 


§6.6 SPECIAL CONTINUOUS DISTRIBUTIONS 


on applying (6), and hence that 


ae 4 ae H; ( z H fide (7) 


Since the Hi[(« — p)/c] are polynomials in (x — 4), the o; will be linear 
functions of the moments of x about the mean. 

The Pearson curves and the Gram-Charlier series were devised to 
meet the following practical problem: In general f(x) is unknown, and 
all that is available is à sample of values of z. By means of the 
sample, the moments of f(x) can be estimated. A Pearson curve 
which is intended to approximate f(x) may be fitted to the sample by 
equating the sample moments to the theoretical moments and solving 
for the parameters which appear in the theoretical moments. These 
values of the parameters are then substituted in the function to obtain 
a specific function which is meant to approximate f(x). Similarly, 
having estimated the moments, they may be used to determine a set 
of values of o; which, when substituted in (4), gives an approximation 
to f(x); in this method only the first few terms of the infinite series are 
used. 

Actually the process of fitting a smooth curve to a sample does not 
add anything to our information about f(x) that is not contained in the 
sample. The fitted curve may, in fact, give one an entirely misleading 
impression of the real density function. However, when the sample is 
quite large, it is sometimes convenient to replace the data by some sort 
of fitted curve in order to simplify further computations. Insurance 
companies and certain government agencies which deal with large 
masses of data find the technique convenient. 


6.6. Problems 


1. Find and plot the eumulative form for the uniform distribution. 
2. What transformation will change the variate x to one which 
will have the uniform distribution over the unit interval if 


fe) - 25» 


1<2< 3? What interval for the new variate corresponds to 
11 <2 < 2.9? 

3. Plot n(x; 0, .25), n(x; 1, .25), and n(x; 1, 9) on the same graph. 
What would be the appearance of the distribution if e were very small? 
(Use Table I.) 

120 


PROBLEMS §6.6 


4. If x is normally distributed with unit mean and e = 4, find 
P(« > 0) and P(.2 <a < 1.8). 

5. Find the number k such that for a normally distributed variate, 
P(u — ke < x < u + ke) = .95. What would k be if P = .90? .99? 
For what value of k is P(x > u — ke) = .95? 

6. Find the generating function E(e'€-2) for the moments about 
the mean for a normal distribution. 

T. Find m, in terms of ø for a normal distribution for r even and r 
odd. (Expand the above generating function in an infinite series.) 

8. What constant multiplier will change the function e+= into a 
density function? What are the mean and the variance of the result- 
ing distribution? 


9. Evaluate ii s e* dz. 


10. Evaluate [A P qe dy. 


11. Plot the gamma density for a = 1,8 = 1;a = 1,8 = 2;a = 2, 
B=l;a=4,8=1. 

12. Find the third moment, uj, of the gamma distribution, 

13. If in the gamma distribution 8 is put equal to 2 and ais put equal 
to (n — 2)/2, the resulting distribution is ealled the chi-square dis- 
tribution with n degrees of freedom. Find its moment generating 
function and its mean and variance. 

14. Find k such that P(x > k) = .05 for the chi-square distribution 
with two degrees of freedom. 

15. Find the rth moment of the gamma distribution without using 
the moment generating function. 

16. Find the rth moment of the gamma distribution using ihe gen- 
erating function. 

17. Plot the beta density for a = 0, 8 = 0; a = 1, B = 1; a = 3, 
B = 3; a = 2, 8 = 3; a = 3, 8 = 2. What would be the appearance 
of the function if both a and 8 were large? 

18. Find the mean and variance of the beta distribution. 

19. Show that the beta density is symmetric about the point x = 14 
when o = f. 


o 


n a oe a de 
20. Find the mean of the Cauchy distribution if = [ se Ga? 


is defined to be 
li RS 1 x dx 
FE, P E a E E 


Show that the distribution does not have any higher moments. 
121 


86.6 SPECIAL CONTINUOUS DISTRIBUTIONS 


21. Integrate Pearson’s differential equation when c and d equal 
zero. What family of distributions does the result represent? 

22. Show that any Gram-Charlier expansion must have a= 1, 
a = 0, a» = 0. 

23. Evaluate œ, for the Gram-Charlier expansion of f(x) = 1, 
0<2 <1. Plot f(x) and plot 


4 — 
file) = mly) Y «Hg y = E 


i=0 


in order to see how the sum of first few terms of the Gram-Charlier 
series begins to approximate f(x). 

24. Compare the Cauchy density and the normal density with 
v = 2 by plotting them on the same graph both with mean zero. 
Notice that the variance is a poor criterion for comparing two distribu- 
tions unless it is known that they have the same functional form. 

25. What are the cumulants of the normal distribution? 

26. Let z have the gamma distribution with parameters o — 10, 
B — 1. How many moments does y = 1/x have? 

27. If x has the gamma distribution, find the moment generating 
function of y = log a. 

28. A variate x has the density 


fe) = 24/2 x2 2 >0 


Find its mean and variance. 

29. A variate has moments y! = rl. Find its moment generating 
function and then deduce its distribution, 

30. A variate x has the uniform distribution over the unit interval; 
what function of z has the gamma distribution with a=0,6=1? 

31. A variate x has the beta distribution with a = 0,8 =1. What 
function of x has the gamma distribution with æ = 0821? 


; ! ; 
32. A variate has moments u? = v when r is even and p = 0 


whenrisodd. Deduce the distribution of the variate from its moment. 
generating function. 

33. Show how tables of the incomplete gamma function F(x; a, 8) 
may be used to evaluate the cumulative Poisson distribution, say, 


PROBLEMS §6.6 
34. If log x is normally distributed with u = 1, c? = 4, find 


POS STL?) 
(log 2 = .693) . 
35. A variate x has the density 


{@)= 24 se z>0 
Find P(x < 4). 
36. Determine the mean and variance of the normal distribution 
by differentiating the identity 


e n(x; u, o?)dx = 1 


with respect to u and with respect to c?. 

37. A variate x is said to be transformed to standard scale if it is 
divided by its standard deviation. Show that the cumulants of z/o 
are equal to y,/-3/?, where y, is the rth cumulant of x. 

38. Show that the gamma distribution is nearly normal when « is 
large, by comparing the cumulants of the two distributions on standard 
scale. 

39. A variate z is normally distributed with mean y and variance o?. 
Show that the mean of the conditional distribution of x, given 


a «up <h 
n(a) — n(b) , 
“+ NO) Na)” 


40. A variate x has density f(x). How might one determine a 
function u(x) such that u is distributed by g(u)? 


123 


CHAPTER 7 
SAMPLING 


7.1. Inductive Inference. Up to now we have been concerned with 
certain aspects of the theory of probability. The subject of sampling 
brings us to the theory of statistics proper, and we shall consider 
briefly here one important area of the theory of statistics and its rela- 
tion to sampling. 

Progress in science is ascribed to experimentation. The research 
worker performs an experiment and obtains some data. On the basis 
of the data certain conclusions are drawn. The conclusions usually 
go beyond the materials and operations of the particular experiment. 
In other words, the scientist may generalize from a particular experi- 
ment to the class of all similar experiments. This sort of extension 
from the particular to the general is called inductive inference. It is 
the way in which new knowledge is found. 

Inductive inference is well known to be a hazardous process. In 
fact, it is a theorem of logic that exact inductive inference is impossible. 
One simply cannot make a perfectly valid generalization. However, 
uncertain inferences can be made, and the degree of uncertainty can 
be measured if the experiment has been performed in accordance with 
certain principles. One function of statistics is the provision of 
techniques for making inductive inferences and for measuring the 
degree of uncertainty of such inferences. Uncertainty is measured in 
terms of probability, and that is the reason we have devoted so much 
time to the theory of probability. 

Let us consider a particular experiment to make the above ideas 
somewhat more concrete. Suppose a nutritionist studying a vitamin 
deficiency wishes to discover the effect of a certain diet. He selects, 
say, ten individuals and gives them the diet for a number of days or 
weeks. And let us suppose that the diet plainly affected all the indi- 
viduals as reflected by some measurable criterion, say loss of weight or 
decreased metabolism. The nutritionist is not interested in confining 
his conclusions to this particular group of individuals. He would like 
to conclude that all or at least a large proportion of all individuals 
would react similarly to the diet. 

124 


f 
v 
- 


INDUCTIVE INFERENCE . j S7.1 


It is clear that no certain generalization is possible. It is conceiv- 
able, for example, that the nutritionist was unfortunate enough to have 
selected individuals who happened to be physically on the downgrade 
at the time, so that the apparent results of the experiment were not in 
fact a consequence of the diet. Or the individuals may have been 
exposed to some minor malady which was not recognized. Some item 
of food in the diet may have been tainted. In fact, one could list a 
great many accidental circumstances which could have produced the 
observed results quite independently of the diet. Whatever general- 
ization is made must be an uncertain one. 

To complete the discussion, we shall consider one very simple kind 
of inference that may be made. Let us suppose that the individuals 
were selected from some large group of individuals, say the inhabitants 
of a county or state. We may envisage the possibility that there is 
some proportion p of the individuals in the large group which will be 
adversely affected by the diet and that the remaining proportion 
q = 1 — p will be favorably affected or unaffected by the diet. Of 
course it is possible that g may be zero. If the ten individuals were 
drawn at random (with replacement) from the large group, then the 
probability that all ten would be adversely affected is p!°. Suppose 
we consider a few specific values for p. If p = 14, then p = 14954. 
If in fact p is one-half for the large group, then the experimenter has 
been most unlucky in his selection, for then a 1 in 1024 chance has 
occurred. If we try p = .7, we find p = .03, which would still make 
the sample rather improbable. We may reasonably suppose that 
p> .7. In fact, we may say, “Taking account of sampling fluctu- 
ations only, p is greater than .7 unless a chance with probability less 
than three in one hundred has occurred in the experiment.” 

The last statement is an inductive inference. Somewhat more use- 
ful inferences could be made by taking account of the actual measure- 
ments of, say, the losses in weight, but this simple one will illustrate 
the points we wish to make here. While we say that p > .7, we admit 
the possibility that we may be wrong, and we give a measure, .03, of 
the maximum probability that we may be in error. By increasing the 
maximum probability of error we could narrow the range for p. Thus 
we might say p > .9 unless a chance with probability less than .103 
has occurred. The size of the probability of error is a matter of taste 
to a large extent. Some investigators commonly use .05 while others 
wish to be more conservative and use .01 or .001. 

It is to be observed that the probability of error measures only the 
error due to random sampling fluctuations. We have not said any- 

125 


§7.2 SAMPLING 


thing about the possible accidents that were mentioned earlier. And 
in fact it is impossible to say what the probability of such accidents 
may be. The nutritionist can only say something like this: “Barring 
accidents, p > .7 for the group of individuals from which the ten were 
selected, unless a chance with probability less than .03 has occurred 
in the experiment.” 

We may mention one other point here. Referring to the same 
experiment, is it possible to conclude without error that p > 0? The 
answer to this is “Yes” in theory but generally “No” in practice. 
The accidents that may have occurred rule out an inference of this 
kind. An experimenter willingly assumes that he performs his experi- 
ments with such care that the probability of accidents is negligible in 
comparison with the probability of his sampling errors, but he cannot 
assume that accidents are impossible. 

The theory of statistics thus has a part in any inductive inference 
based on experimental data. Its role is to provide a measure, in terms 
of probability, of the uncertainty of the inference. The measure will 
be based entirely on sampling errors. It is up to the experimenter to 
guard against accidents which may invalidate his results, and the 
theory of statistics makes no attempt to deal with this aspect of the 
problem of inference. 

7.2. Populations and Samples. The word population in statistics is 
used to refer to any collection of objects or results of operations. Thus 
we may speak of the population of dairy cattle in Wisconsin, the popu- 
lation of prices of bread in the City of New York, the population of 
mileages of automobile tires, the hypothetical population of heads and 
tails obtained by tossing a coin an infinite number of times, the hypo- 
thetical population of an infinite number of measurements of the 
velocity of light, and so forth. 

The problem of inductive inference is regarded as follows from the 
point of view of statistics: The object of an experiment is to find out 
something about some specified population. It is impossible or 
impracticable to examine the entire population, but one may examine 
a part or sample of it, and on the basis of this limited investigation 
make inferences regarding the whole population. 

It is important that the sample be chosen from the population it is 
desired to study. This obvious principle is violated surprisingly often. 
Thus in the nutrition example mentioned above, if the nutritionist 
wishes to make an inference about the population of the United States, 
his ten subjects must be randomly selected from that population. If, 
in fact, the ten subjects were chosen from among thirty students in 

126 


POPULATIONS AND SAMPLES 87.2 


one of his classes in home economics, then he has studied a very limited 
population indeed. He can make a rigorous inference only concerning 
the thirty students. Actually, of course, he would probably extend 
his results to cover a larger population with considerable justification. 
He could argue that the mere fact that the ten subjects happened to be 
taking a particular course in home economies could not conceivably 
influence the experiment and that the results could certainly be taken 
as representative of all women students in the college. And from 
other experiments he may assume that sex has no effect on reactions 
to diets and claim his results apply to men students as well. He may 
generalize further and say the results reasonably represent all people 
of college age in the region from which the college draws most of its 
students. But here he might be getting on shaky ground, because it 
is well known that college students come from the wealthier and hence 
better nourished families in the region. It is even more doubtful if 
the results could be taken as representative of the whole adult popula- 
tion of the region. And it would be completely unjustifiable to claim 
that the results are valid for the adult population of the whole nation, 
because reactions to a given diet depend on the normal diet, which is 
quite different in different regions. 

Extension of the population originally studied to a larger population 
increases the probability of error by an unknown amount and thus 
destroys the measure of confidence to be placed in the inference. The 
careful investigator does not indulge in this practice, but chooses his 
sample from the entire population he wishes to study if it is at all prac- 
ticable. For example, the nutritionist, if he wishes to make an infer- 
ence about the adult population of the nation, might actually select, 
by some device or other, a random sample of individuals from the 
whole adult population and then enlist the aid of colleagues who happen 
to live near the individuals selected. 

We have implied that a sample must be random. It is this property 
of a sample that enables one to compute the probability of error of his 
inference. The theory of probability cannot be applied to a non- 
random sample, so that there is no way to measure the degree of confi- 
dence to be placed in any inference from such a sample. The word 
random refers to the manner in which the sample is selected rather than 
to the particular sample. Any possible sample is a random sample. 
Thus a person may shuffle a deck of cards thoroughly and then blindly 
draw four cards from it, thus obtaining a random sample of four cards. 
If, in fact, it turned out that he drew the four aces, then he obtained a 
very unrepresentative sample of denominations, but still it was a 

127 


$7.3 SAMPLING 


random sample by virtue of the method by which it was obtained. 
Similarly, the nutritionist may have carefully drawn a random sample 
of ten adults and obtained unfavorable reactions to his diet in all 
cases. It may be, in fact, that only a small proportion of individuals 
in the population would have such a reaction and that the nutritionist 
was particularly unlucky in his sample. The margin of error given in 
his inference measures the probability of such a contingency. 

An investigator hopes, by drawing a random sample, to get a fairly 
representative portion of the population’ he wishes to study. Often 
it is possible to introduce a certain amount of nonrandomness in the 
sampling procedure to obtain partial assurance of a representative 
sample. This can be done when something is known about the popu- 
lation. Thus a public-opinion agency may wish to take a preelection 
poll of the United States. It knows the populations of the various 
states and can assure itself a degree of representativeness by allocating 
its sample to states according to the populations of the states. Thus, 
if 1 per cent of population is in a given state, 1 per cent of the sample 
will be taken in that state. Within the state further allocations may 
be made. The sample may be evenly divided between the sexes. 
The proportions of urban and rural dwellers may be forced to agree 
with the actual known proportions within the state. The effect here 
is to divide the population into a great many smaller populations. 
But somewhere along the line random samples of the subpopulations 
must be taken, if inferences with measurable uncertainty are to be 
made, 

7.3. Sample Distributions. Suppose a variate z has density f(x) 
in some population. And suppose a sample of two values of x, say 2i 
and a, are drawn at random. ‘The pair of numbers (zi, x2) determine 
a point in a plane, and the population of all such pairs of numbers that 
might have been drawn forms a bivariate population. We are inter- 
ested in finding the distribution of this bivariate population in terms 
of the original distribution f(x). 

The joint density function for x; and zs must be some function, say 
f(v1, 22), such that for any ai, as, bi, ba we have 


Play < xı < bi, a2 < 2$ < be) = TA pases zs)das dx, (1) 


Now by a random sample we shall mean that the value of the first 
observation xı has no effect whatever on the value of the second obser- 
vation. In other words, for a random sample, zı and zs are inde- 


pendent in the probability sense. When the two variates of a bivariate 
128 


SAMPLE DISTRIBUTIONS 87.3 


distribution are independent in the probability sense, we have seen 
that the joint distribution is the product of the marginal distributions. 
In the present instance, the marginal distributions are simply f(zi) 
and f(x2), so that we have, by definition of randomness, 


Js, t2) = f(a) f(a2) (2) 
or, what is the same thing, 
P(a; < z1 < bi, a < T2 < be) = P(ay < 21 < bi)P(as < zs < ba) (8) 


As a simple example, suppose x can have only two values, zero and 
one, with probabilities q and p, respectively. That is, z is a discrete 
variate which has the binomial distribution 


1 
f(x) = () gba od Ot (4) 
3 if 1 ae 
and since () - () = 1, we may write it as 
Je = pq 


The joint density for samples of two values of v is 


f(a, 23) = preegt-n75 zı = 0, 1, a = 0,1 (5) 


which is defined at the four points (0, 0)(0, 1)(1, 0)(1, 1) in the zi, ae 
plane. It is to be observed that this density is not what we should 
have obtained by drawing two elements from a binomial population 
and counting the number of successes, say y; that density is 


fly) = () pg y-012 (6) 


and it differs from (5) in that it is the distribution of the single variate 
21+ 22. Equation (5) gives us the joint distribution of the two 
random variates x; and 22. 

It is to be noted that f(z:, x2) gives us the distribution of the sample 
in the order drawn. Thus in (5), f(0, 1) = pq not 2pg. (0, 1) refers 
to the probability of drawing first a zero, then a one. And in general, 
(1) represents the probability that the first observation drawn falls 
in the interval (a, bı) and the second falls in (as, b2). The opposite 
occurrence does not satisfy the specification unless, of course, the two 
intervals happen to be the same. j 

By reasoning exactly as before, we find that the joint density for a 


random sample of size n, zi te, **', m from a population with 
129 


§7.4 SAMPLING 
distribution f(x) is 
Fln 2, ++ + , En) = fefe) + + (ns) (7) 


and this again gives the distribution of the sample in the order drawn. 

Our definition of random sampling has automatically ruled out 
sampling without replacement from a finite population. 1f, for exam- 
ple, we draw two balls from an urn containing, say, two white and 
three black balls, the result of the first draw certainly affects the 
probability of the result of the second. The two drawings are not 
independent in the probability sense. In this case, another definition 
of random sampling must be adopted (Probs. 26 and 32). Our present 
discussion in this and in the following chapters is thus concerned with 
sampling from continuous populations (where the question of drawing 
with or without replacement does not arise) and to sampling with 
replacement from finite populations. 

7.4. Sample Moments. If xı, 2» * * * , z, are a sample of n values 
drawn from a population with density f(x), the rth sample moment is 
defined to be 


i 
m=z) x eal bn oe (1) 
i 
m is called the sample mean and is more often designated by z, 
HN 
z=) on (2) 


We shall show that m; may be taken to be an estimate of the popula- 
tion moment yj. 

Suppose g(x) is any function of x; then the expected value of the 
function is 


Eiga) = [ 7, fes (3) 


We shall see that for a large sample, 21, 25, * * * , £h, theexpression 


1 n 

J 
n P. g(z;) 
may be expected to approximate H[g(x)]. Let the area under f(x) be 
divided into strips of width Az,, and let n; be the number of sample 
elements which fall in Az; with En; = m. Let z; be the mid-point of 
the interval Az;. Then if the Az, are small, all the z; which fall in Az; 
will not differ much from z; and we may write 

130 


SAMPLE MOMENTS 87.4 
22,60 = IL (4) 


Now the area over any Az; is aparextmntaly S(x;)Az;, and it is the prob- 
ability, say p; that any randomly drawn value of æ will fall in Az; 
If a sample of n values of z is drawn, we expect np; of the sample values 
to fall in Az;. Tt follows then that n;/n is an estimate of p;, and (4) 


may be written 
1 Nw VSG 3 
5.2, 90) & Y gl) 
S Zg(j)f(v;) Ac; 
This last sum approximates the integral in (3). 


OX, vv —— d 
Fra. 34. 


The above argument is merely heuristic and does not prove any- 
thing. It does give some insight, however, into the way in which 
samples provide information about distributions; We can prove 
directly that the expected value of (1/n)Zg(a;) is E[g(x)]. (We now 
drop the primes from the 2;; they were used above to distinguish the 
sample values from the mid-points of the intervals. The joint 
density of the zi, z», * * * , z, is 


fin eu, = > faa I o (5) 


t=1 


hence the expected value of the sum is 


sso] [fo [ten i pnta o 


This integral may be written as the sum of n integrals of the form 


t po f oe Mea 


131 


§7.4 SAMPLING 


which in turn may be written as the product of n integrals, all but one 
of which are of the form 


[f(zj)dz; = 1 
and the remaining one is i 
1 1 
a f g(x)f(z)dz; = 7 Elg(x)] (7) 
Since (6) is the sum of n such integrals, we have 
1 
E [: > «e| = Eig) (8) 


On choosing g(x) to be x”, we find that the expected value of the rth 
sample moment is the rth population moment, 


E(m!) = E e Y a) 


= E(x’) 
=p; (9) 


We may review the meaning of this result. The sample moment m; 
is a function of n random variables and is therefore a random variable 
itself. As such, it has some probability distribution, and equation 
(9) shows that the mean value of that distribution is uj. We do not 
therefore suppose that m; is in any sense equal to y/ for a given sample; 
it is simply a random variable whose mean is iJ. We shall speak of 
m; as being an estimate of uz. Whether or not it will be an accurate 
estimate depends on how closely the distribution of ml is concentrated 
about its mean. 

Corresponding to the population moments u, about the mean, we 
may define sample moments about the sample mean as follows: 


We have m; = 0 just as ui = 


[U 
m= LY 3) 


The m, may be regarded as estimates of the u, in the same sense that. 
m; estimate u;; however, they are biased estimates. That is, it is not 
132 


THE LAW OF LARGE NUMBERS $7.5 


true that 
E(m) = by 


except when r = 1. We shall illustrate this fact for r = 2in Probs. 12 
and 13. 

7.5. The Law of Large Numbers. We have seen that the expected 
value of a sample mean is the population mean. 


E@) =p (1) 


Let us find the variance of the sample mean x 


-# G3 5—9) 
E E» (a — 2 
-ir[Y«-»| 2 


On squaring the sum, we get n terms of the form (v; — u)? and () 


terms of the form 2(z; — u)(z; — u) with i = j. The expected value 
of (zx; — u)? depends only on the marginal distribution of z;, since in 
the integral 


THICK ern II [f(2;)da;] 


all faetors not involving z; become one and we are left with 
f(x — n)*f(a)da; = o° (3) 
where c? is the population variance. Similarly, 
Elle: — aye — 2 = SS (i — iX — wf leafless de; 
= f (x; — w)f@)das J (z; — w)f@i)da; 
ED (4) 
Equation (2) then becomes 


z n? 
i=l 
1 
324 
o (5) 
n 


§7.5 SAMPLING 


Thus the variance of the sample mean is equal to the population vari- 
ance divided by the sample size; this is true for any population with a 
finite variance. 

This fact is of extreme importance in applied statistics. It implies 
that whatever the population distribution (provided it has a finite 
variance), the distribution of the sample mean becomes more and more 
concentrated near the population mean as the sample size increases. 
It follows that the larger the sample, the more certain we can be that 
the sample mean will be a good estimate of the populationmean. This 
is essentially the law of large numbers. We shall obtain a more pre- 
cise statement of it below. 

Suppose the density of the sample mean is g(z), where Z is the mean 
of a sample of size n from a population with density f(x). We have 


g(x) 


m-an 4 Magn x 
Fra. 35. 


found that the mean and variance of g(Z) are u and c?/n, where » and 
c? are the mean and variance of f(z). It follows from the definition 
of the variance that 


a- S. [7 ease © 


Now let us break up the range of integration into three parts, as illus- 
trated in Fig. 35: 


2 u— (ao/ y/n) u+ (ao/vV/n) 
Z = » (E — u)g(z)dz + je ya ee — u)'g(z)dz 


F is z — u)mg(z)dz (7) 

ae ee 

where a is any arbitrarily chosen positive number. We are going to 

obtain an inequality by reducing the right-hand side of equation (7). 

We shall discard the second integral, and since it is positive, the right- 

hand side will be decreased. Also in the first integral we shall replace 
134 


THE LAW OF LARGE NUMBERS §7.5 
the factor (Z — u)? by a?c?/n. This will clearly reduce the value of 
the integral, since in the range of integration 
ME 
Vn 


The same substitution will reduce the third integral also. We shall 
have then 


lE — uz 


ot a?o? [u-la a?o? c 
uum %)d% + — z)dī 8 
n n Ja ET, n n (ac/N/n) ge) ( ) 
or, what is the same thing, 
1 " ac 
a> P(e Fl >=.) (9) 
since the two integrals in (8) give exactly the probability that z lies 
outside the interval u — (=) to u + (ac//n). 
Now in (9) let ac/+/n = b; then 1/a? = o?/nb?, and (9) becomes 
g? 
p(e-a >0) <5 (10) 


This relation is known as Tchebyshefi’s inequality. It may be 
written in the alternative form 


2 
P(—b $3 — p <b) > 1- z (11) 


Tchebysheff’s inequality gives a precise formulation of the law of 
large numbers. Referring to (11), we may choose any small number b 
and determine a small interval about the population mean; having 
done this, we may choose n large enough to give a value as near one as 
we please for the probability that the sample mean will lie within the 
small interval containing the population mean. 

To consider an example, suppose some distribution with an unknown 
mean has a variance equal to one. How large a sample must be taken 
in order that the probability will be at least .95 that the sample mean 
will lie within .5 of the population mean? We have c? = 1, b= .5, 
and we wish to choose n so that 1 — c?/nb? will be .95. 


whence 
c? 1 
EUR e n ee) 
^ = "0562 ^ 5(5): 
135 


§7.6 t _ SAMPLING 


'The example is not realistie because the variance is assumed to be 
known. Later we shall have to consider ways of cireumventing this 
difficulty. The important thing here is the indication of the possi- 
bility of making very accurate and reliable inferences provided large 
samples can be obtained. 

7.6. The Central-limit Theorem. The central-limit theorem gives 
a still more precise statement of the law of large numbers. It is the 
most important theorem in statisties from both the theoretical and 
applied points of view. And it is one of the most remarkable theorems 
in the whole of mathematics. A great many eminent mathematicians 
(De Moivre, Laplace, Gauss, Tchebysheff, Liapounoff, Levy, Cramer, 
and others) have contributed to its development, The theorem is this: 

If a population has a finite variance c? and mean n, then the distribu- 

tion of the sample mean approaches the normal distribution with variance 
c?/n and mean y as the sample size n increases. 
The astonishing thing about the theorem is the fact that nothing is 
said about the form of the population distribution function. What- 
ever the distribution function, provided only that it have a finite 
variance, the sample mean will have approximately the normal distri- 
bution for large samples. The condition that the variance be finite is 
not a critical restriction so far as applied statistics is concerned, 
because in almost any practical situation the range of the variate will 
be finite, in which case the variance must necessarily be finite. 

We shall not be able to prove this theorem, because it requires rather 
advanced mathematical techniques. However, in order to make 
the theorem plausible, we shall consider an argument for the more 
restricted situation in which the distribution has a moment generat- 
ing function. The argument will be essentially a matter of showing 
that the moment generating function for the sample mean approaches 
the moment generating function for the normal distribution. We 
shall first obtain the moment generating function for 

a’ n u 


YRR 21 


when « is normally distributed. The generating function is 


m(t) = I en(x'; u', o'*)da' (1) 
o 1 " 
= egi! iyu gH Gu!) qut 2 
or @) 
and as in section 6.2 we find 
m(t) = ee (3) 


136 


THE CENTRAL-LIMIT THEOREM §7.6 


Now suppose x has some arbitrary density function f(x) with mean 
u and variance c? which has a moment generating function. The 
moment generating function of (x — 1)/e, say m(t), is defined as 


ma(t) = Tus ete /ef (dx (4) 


A sample of size n will have a mean with some distribution, say g(2) 5 
which we have seen must have mean p and variance c?/n. The 
moment generating function for 
iu 
z= 5 
o/V/n v 


say m(t), is defined as 
E Age Bi 3 
mit) = |", e" v g(a)az (6) 


It is our purpose to show that m;(t) must approach ma(£) when n, the 
sample size, becomes large. 

We can determine m;(¢) in terms of ma(t). m(t) is the expected 
value, 


E (cw) = E (e MS 
and since we know that the joint distribution of the zi, £a, * * * , z, is 


f f(x), we may write 


Zi—u ji 


m() = a A vee Il edz: 


i zi—u 


- if fT adt de (7) 


E D 


and by virtue of (4), each factor in this product is simply ma(t/+/n); 


hence 
mo [9] ° 


The rth derivative of ms(t/+/n) evaluated at t = 0 obviously gives us 
the rth moment about the mean divided by (c s/n). And we have 
Seen in Sec. 5.3 that we may write 


m: (ie) = acted er ++ @) 


137 


§7.6 SAMPLING 


and since ui = 0, us = c?, this may be written 


PM BU EUR yey a 
m (5) = 1*3 (2! rv aN (10) 


If we recall that the definition of e" is 


É u” 
e = lim ( +3) 


we see that ms(t), as n becomes infinite, becomes of exactly this form, 
where u represents the parenthesis in (10), and when n becomes 
infinite, all terms in uw vanish except the first, so we have 

lim ma(f) = à” (11) 


n c 


Hence in the limit z has the same moment generating function as y 
and, by virtue of the statement at the end of Sec. 5.4, has the same 


1 
(6) 
Fia. 36. 


distribution. Thus in the limit the sample mean must have the normal 
distribution whatever the distribution f(x), provided that f(x) has a 
moment generating function, or more generally, provided that f(x) 
has a second moment. And for large n, we may say that the sample 
mean is approximately normally distributed. 

The degree of approximation depends, of course, on the sample size 
and on the particular density function f(z). The approach to normal- 
ity is illustrated in Fig. 36 for the particular function f(x) = e”, 
t 7 0. The solid curves give the actual distributions, while the 
dashed curves give the normal approximations. (a) gives the original 
distribution which corresponds to samples of one; (b) shows the dis- 


tribution of sample means for n — 3; (c) gives the distribution of 
138 


NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION 87.7 


sample means forn = 10. The curves rather exaggerate the approach 
to normality because they cannot show what happens on the tails of 
the distribution. Ordinarily distributions of sample means approach 
normality fairly rapidly with the sample size in the region of the mean, 
but more slowly at points distant from the mean ; usually the greater 
the distance of a point from the mean, the more slowly the normal 
approximation approaches the actual distribution. 

The central-limit theorem applies to discrete as well as to continuous 
distributions. The moment generating functions used in this section 
could have been moment generating functions for discrete distribu- 
tions, and the argument would have been just the same except that 
the integrals would have been replaced by sums. We shall investigate 
the nature of this approximation in the next section for a particular 
discrete distribution. 

7.7. Normal Approximation to the Binomial Distribution. We 
shall consider the density 


S@) =p" z-01 (1) 

which has 
=p =p (2) 
and suppose a sample, z; zs, ** : , z,, of size n is drawn. The 


sample will simply be a sequence of zeros and ones in this instance, one 
denoting a success, say, and zero a failure. And 


nie 


is the proportion of successes in the sample. We have seen that the 
mean and variance of Z are 


p (3) 
OI CD (4) 


The distribution of z is discrete; in fact z can take on only the values 


12 j 
Discs uem qe 


and we know that the density of j is 


fG) = () pq j-012,:-:,n (5 
139 


87.7 SAMPLING 


Thus since j = n£, the density of Z is 


zy = ( ™ ) pzm gebat on 
AG) (2)» qe» ^ r-06L501 Q 


The way in which this discrete density is approximated by a con- 
tinuous density function is illustrated in Fig. 37. 

Suppose we construct rectangles of heights h(%) and widths 1/n with 
mid-points of the bases at j/m, j = 0, 1, 2, - * - , m. The tops of 


Xıl 


these rectangles form a broken curve which we may represent by g(#). 
Since ZA(Z) = 1, the area under g(z) will be 1/n. It is clear that 


in (: e A ana ; 
may z)dz 
ne ie icon USD @ 


for any integers a and b (b > a) in the range of j, since the integral is 
simply the area under the tops of the rectangles over the points a to b 


and is therefore 
b 
zit 1 TY 
h(@) = == ) n—í 
@) nme 1 E (") pg 8) 


As n becomes large, the width of the rectangles decreases and the 
steps in the function ng(Z) become closer together so that it has the 
appearance, say, of the function in Fig. 38. . The normal approxima- 
tion to the binomial distribution may be regarded as the limiting form 
of this broken curve as n becomes infinite. 

This normal approximation is of particular interest because it pro- - 
vides a method of computing easily the approximate value of sums of 
the binomial distribution. As an illustration, let us suppose. a true 


die is cast and a one or a two counted as a success. Then p = 4, 
140 


b/n 


$—a/n 


NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION 87.7 


q = 24. For a sample of 300 trials, the total number of successes, j, 
has the density 


300—j 
so (ORT () D Ape 


Suppose we wanted the probability that the number of successes will 
not deviate from 100 by more than 15; we should have to sum f(j) 


ng(x) 


0 L0 x 
Fig. 38. 


over the range 85 to 115, a very tedious calculation. We can approxi- 
mate the sum by using the fact that 


and since = 7/300 is APPIO normally distributed with mean 
1 and variance lá X 24 X l$oo, we have 


T? 15 
P(85609 < Z < 115499) S fers a * n(2; 1, pora 
113600 1 


ATR 2s N/ 340 
and letting ¢ = (z — 14)/+/?4700, we have 


85 <5 s Í I9 dle edo 
P (Ss ~ 800 iss Vm 


141 


g 16-19 700 di 


§7.8 SAMPLING 


since 
(85400 — 14) ~ 1.84 115499 — L4) ~ 1.84 


V%i00 —  . Vv 72700 
Using tables of the normal distribution, we find 
P(85 < j < 115) = .934 (9) 
The approximation could be slightly improved by using 85 — 14 and 


115 + 14 in computing limits on the integral as indicated by (7). 
In general, for the binomial distribution, it is now evident that 


b 

Pasiso- Y (") oe (10) 
sf 1 -i* dt (11) 

E vm 

where 
[(a — 12)/n] — p [(b + 14)/n] — p 

p b = 12 
i v pa/n V pq/n 


A more detailed investigation would show that the error in this 
approximation is less than 
15 


v npg 
provided npg > 25. Thus in the above example our maximum error 
is measured by 


(13) 


.15 
VESXHXH 
so that the approximation (9) does not quite have two-place accuracy 
in so far as we can judge by (13). More accurate approximations are 
provided by Uspensky (“Introduction to Mathematical Probability," 
Chap. VII, McGraw-Hill Book Company, Ine., New York, 1937). 

7.8. Role of the Normal Distribution in Statistics. It will be found 
in the ensuing chapters that the normal distribution plays a very 
predominant part. Of course, the central-limit theorem alone ensures 
that this will be the case, but there are other almost equally important 
reasons. 

In the first place, many populations encountered in the course of 
research in many fields seem to have a normal distribution to a good 
degree of approximation. It has often been argued that this phe- 
nomenon is quite reasonable in view of the central-limit theorem. We 


may consider the firing of a shot at a target as an illustration. The 
142 


PROBLEMS §7.9 


course of the projectile is affected by a great many factors all admit- 
tedly with small effect. The net deviation is the net effect of all these 
factors. Suppose the effect of each factor is an observation from some 
population; then the total effect is essentially the mean of a set of 
observations from a set of populations. Being of the nature of means, 
the actually observed deviations might therefore be expected to be 
approximately normally distributed. We do not intend to imply here 
that most distributions encountered in practice are normal, for such 
is not the case at all, but nearly normal distributions are encountered 
quite frequently. 

Another consideration which favors the normal distribution is the 
fact that sampling distributions based on a parent normal distribution 
are fairly manageable analytically. In making inferences about popu- 
lations from samples it is necessary to have the distributions for 
various functions of the sample observations. The mathematical 
problem of obtaining these distributions is often easier for samples from 
a normal population than from any other. 

Because all these auxiliary distributions are required in statistical 
inference, the economical thing to do is obtain them for one kind of 
population distribution only. When another kind of population is 
under examination, the observations may be transformed so that they 
follow the distribution first chosen. The normal distribution is the 
logical candidate for this choice. Thus if a complete theory of sta- 
tistical inference is developed based on the normal distribution alone, 
then one has in reality a system which may be employed quite gen- 
erally, because other distributions can be transformed to the normal 
form, 

In applying statistical methods based on the normal distribution, 
the experimenter must know, at least approximately, the general form 
of the distribution function which his data follow. If it is normal, he 
may use the methods directly; if it is not, he may transform his data 
80 that the transformed observations follow a normal distribution. 
When the experimenter does not know the form of his population 
distribution, then he must use other more general but usually less 
powerful methods of analysis called distribution-free methods. Some 
of these methods will be presented in the final chapter of the book. 


7.9. Problems 
1. In the joint distribution p=*+=g?", for a sample of two from a 
binomial population, let z; = y — zs and find the joint distribution 


of y and zs. 
143 


§7.9 SAMPLING 


2. Find the marginal distribution of y from the results of the above 
problem. 

3. What is the probability that the two observations of a sample 
of two from a population with a rectangular distribution over the unit 
interval will not differ by more than one-half? 

4. What is the probability that the mean of a sample of two obser- 

_ vations from a rectangular distribution (over the unit interval) will 
be between 14 and 34? 

5. What is the probability that the larger of two random observa- 
tions from any continuous distribution will exceed the median? 

6. If zı and zz are a sample of two from a population with density 
f(x), and if the smaller of these values is denoted by y: and the larger 
by ys, what is the joint density of yı and y2? 

7. Generalize the result of Prob. 6 to samples of size n, letting y1 
be the smallest and ys the largest of the n observations. 

8. What is the marginal density of the smallest observation for 
samples of size n? 

9. Considering random samples of size n from a population with 
density f(x), what is the expected value of the area under f(x) to the 
left of the smallest sample observation? 

10. Balls are drawn with replacement from an urn containing one 
white and two black balls. Let = 0 for a white ball and x = 1 for’ 
a black ball. For samples 21, tə - * * , 29 of size nine, what is the 
joint distribution of the observations? The distribution of the sum 
of the observations? 

11. Referring to Prob. 10, find the expected values of the sample 
mean and sample variance. 

12. For samples of size two from a population with variance o°, 
show that the expected value of the sample variance is c?/2. 

13. Generalize the result of Prob. 12 to samples of size n. 


14. What value of y minimizes »y (a — y)?? 
T 


n 
15. If = = (1/n) »; z; show that 
1 


$ e- m = È mF + nee — 0 


Using this result and that of Prob. 14, explain why the sample variance 
gives a biased estimate of the population variance. 
16. Find E(m;) for samples of size two from a population with a 
finite third moment. 
144 


PROBLEMS §7.9 


17. Show that E[(1/n) (v; — u)'] = u, for samples of size n from a 
population with mean u and rth moment pr. 

18. Use Tchebysheff’s inequality to find how many times a coin 
must be tossed in order that the probability will be at least .90 that z 
will lie between .4 and .6. (Assume the coin is true.) 

19. How could one determine the number of tosses required in 
Prob. 18 more accurately, i.e., make the probability very nearly equal 
to.90? What is the number of tosses? 

20. If a population has c = 2 and ž is the mean of samples of size 
100, find limits between which z — y will lie with probability .90. Use 
both Tchebysheff's inequality and the central-imit theorem. Why 
do the two results differ? } 

21. Suppose xı and zz are means of two samples of size n from a 
population with variance o?. Determine n so that the probability 
will be about .01 that the two sample means will differ by more than c. 
(Consider the variate y = %1 — 2s.) 

22. Suppose light bulbs made by a standard process have an average 
life of 2000 hours with a standard deviation of 250 hours. And sup- 
pose it is considered worth while to replace the process if the mean life 
can be increased by at least 10 per cent. An engineer wishes to test a 
proposed new process, and he is willing to assume that the standard 
deviation of the distribution of lives is about the same as for the 
standard process. How large a sample should he examine if he wishes 
the probability to be about .01 that he will fail to adopt the new process 
if in fact it produces bulbs with a mean life of 2250 hours? 

23. A research worker wishes to estimate the mean of a population 
using a sample large enough that the probability will be .95 that the 
sample mean will not differ from the population mean by more than 
25 per cent of the standard deviation. How large a sample should he 
take? 

24. A polling agency wishes to take a sample of voters in a given 
state large enough that the probability is only .01 that they will find 
the proportion favoring a certain candidate to be less than 50 per cent 
when in fact it is 52 per cent. How large a sample should be taken? 

25. A standard drug is known to be effective in about 80 per cent of 
cases in which it is used to treat infections. A new drug has been 
found effective in 85 of the first 100 cases tried. Is the superiority 
of the new drug well established? (If the new drug were equally 
effective as the old, what would be the probability of obtaining 85 or 
more successes in a sample of 100?) 

26. A bowl contains five chips numbered from one tofive.. A sample 
of two drawn without replacement from this finite population is said 

145 


§7.9 SAMPLING 


to be random if all possible pairs of the five chips have an equal chance 
to be drawn. What is the expected value of the sample mean? What 
is the variance of the sample mean? 

27. Suppose the two chips of Prob. 26 were drawn with replacement, 
what would be the variance of the sample mean? Why might one 
guess that this variance would be larger than the one obtained before? 

28. If a density f(z) has a moment generating function m(/), show 
that the mean of samples of size n has the moment generating function 
[m(t/n)]”. 

29. Use the result of Prob. 28 to show that the mean and variance 
of the sample mean are y and c?/n. 

30. Find the third moment about the mean of the sample mean for 
samples of size n from a binomial population. Show that it approaches 
zero as n becomes large (as it must if the normal approximation is to be 
valid). = 

31. Suppose the life of a certain part of a machine is distributed by 
Ole" where ¢ is measured im days. The machine comes supplied 
with one spare. What is the density of the combined life of the part 
and its spare? 

32. Generalize Prob. 26, considering N chips and samples of size n. 
The variance of the sample mean is 


N-n 
nN-—1 


where c? is the population variance, 


146 


CHAPTER 8 
POINT ESTIMATION 


8.1. Estimation of Parameters. The estimation of parameters is a 
primary purpose of all scientific experimentation, and before formulat- 
ing the problem precisely, it may be worth while to consider briefly > 
its practical implications. 

Suppose a plant breeder wishes to determine the general yielding 
ability of a new hybrid line of corn in some agricultural region. To do 
this, he selects a number of farms in the region and obtains the yields, 
say in pounds, of small plots planted on each of several farms. He 
thus obtains a set of observations, say 45, 27, 36, 34, 59, 40, +++. 
The ayerage of these numbers gives a measure of the yielding ability. 
This average is an estimate of the mean u of some population with a 
density f(x). Of course the population needs to be carefully specified. 
Were the farms selected at random? Did the farmers cultivate the 
plot along with the rest of the crop, or did the plots have special treat- 
ment? What were the weather conditions in that season? And so on. 
But leaving aside these questions and assuming randomness, we 
regard the experiment as a drawing of a sample from a population with 
density f(x) for the purpose of estimating the mean of the distribution. 

Since the observations were obtained only to the nearest pound, the 
distribution is, in fact, discrete. However, for measurements (as 
opposed to countings) it is customary to think of a continuous distribu- 
tion. The observations could have been obtained more accurately, 
but any effort in that direction would have been wasted because the 
sampling error of the estimate would well exceed errors of rounding 
to the nearest pound. In this connection, however, it is not always 
possible to reduce errors of measurement well below the magnitude of 
sampling errors. "Thus a metallurgist studying thermal expansion of 
some alloy might require à very accurate measurement of the length 
of a rod and make several observations in inches, say 8.562, 8.564, 
8.563, 8.563 - - - , with precision equipment which can measure to 
within about .001 inch. His distribution is discrete (defined at inter- 
vals of .001) and cannot be refined; this discreteness may be the major 


Source of the error of his estimate. 
147 


§8.2 POINT ESTIMATION 


In general, the estimation problem may be stated as follows: One 
is investigating a population with a density function f(x; 61, 02, * * « , 
0,), where x is the variate and 6, 02, * + * , 6; are parameters in the 
distribution. Thus in the ease of the gamma distribution there are 
two parameters which we have called o and £, and in the present nota- 
tion we might exhibit the parameters by writing the gamma density 
as f(x; a, 8). On the basis of a random sample of observations, say 
$1, 39, * * * , Ln, One wishes to estimate one or more of the parameters 
01, 02, * * * , Or. The problem here is to find functions of the observa- 
tions which we may represent by 61(x1, 2s, * * * , En), O0(a1,%2, * * * 24), 

+ , such that the distribution of these functions will be concentrated 
as closely as possible near the true values of the parameters. We shall 
call such functions estimators. We have already seen, for example, 
that if the parameter to be estimated is the population mean y, then 
the function 


Aly, ta eya) =B N (1) 


is an estimator for u and that the distribution of 2 actually does become 
closely concentrated near the true mean y for large samples when the 
population variance exists. 

In speaking of the estimation of parameters, the moments of a dis- 
tribution are usually intended to be included by the term *param- 
eters" even though they may not enter explicitly in the distribution 
function. The moments will ordinarily be functions of the parameters 
which do enter into the functional expression of the distribution, and 
once those parameters are estimated, corresponding functions of those 
estimates will estimate the moments. Of course, the moments can 
‘also be estimated by means of the sample moments as indicated in the 
preceding chapter. 

Any estimate of a parameter is naturally subject to the errors of 
sampling, and it is important to make some statement about the pos- 
sible size of the error when giving an estimate. We shall defer the 
study of errors, however, until a later chapter and consider here only 
point estimates, i.e., single-valued estimates, as opposed to more gen- 
eral estimates which merely specify the parameter to be within à 
given interval. 

8.2. Properties of Good Estimators. To consider the case of a sin- 
gle parameter for simplicity, suppose we have a random sample of 
size n drawn from a population with a distribution f(x; 0). There are 
infinitely many ways of choosing an estimating function 6(x, zs, * * * ; 
2n), and our problem is to choose a good one. Intuitively it is clear 

148 


PROPERTIES OF GOOD ESTIMATORS §8.2 


what is meant by “good ”—the distribution of the estimator should be 
concentrated near the true parameter value 6. Thus if 61, 42, 63 are 
different estimators of 0 with densities g1(01), go(62), gs(03) as illustrated 
in Fig. 39, then 6; is clearly a better estimator than either 6; or 63, and 
6; is better than 6, even though it is biased to the right. 

One method of comparing two estimators is by their relative efficiency. 
If an estimator Ói(zi 22, * * * , tm) has E(Ói — 0)? = Ai, and if a 
second estimator 8s(zi 2e, * * * , zn) has E(s — 6)? = A», then the 
efficiency of 0; relative to 6; is defined to be 4;/45; the ratio is usually 
expressed as a percentage. If the efficiency of Ês relative to 6, is 
greater than 100 per cent, then 62 may reasonably be regarded as a 


à à 
Fra. 39. 


better estimator of @ than 6;. It is to be noted that A; and 4s will 


not be the variances of 6; and 0; unless H(6:) = 0 and E()) = 0. 
Several terms have come to be commonly used to describe esti- 


mators, and we shall define them now. 
Unbiased. If an estimator 6(a1, zs, * * * , 25) for a parameter 0 is 
Such that 
E(0) = 0 (1) 


then 6 is said to be unbiased. If E(0) > 6, the estimator is said to be 
positively biased; while if H(6) < 6, the estimator is said to be nega- 
tively biased. In constructing estimators, it is obviously of some 
advantage to construct an unbiased estimator, but this is not a very 
crucial requirement. If the mean of an estimator differs but little 
from the parameter value relative to the standard deviation of the 
estimator, the estimator may be quite saitsfactory. : 
Consistent. If an estimator 6(x1, qs, ` ` * , v») for a parameter 0 
is such that 
P(ó—50)51  sasn— (2) 
149 


88.2 POINT ESTIMATION 


then 6 is said to be a consistent estimate of 0. The symbolic eriterion 
is a way of stating that the estimate becomes near the true parameter 
value with probability approaching one as the sample size increases 
without limit. The sample mean Z is an example of a consistent 
estimator when the population variance is finite, for has a variance 
c?/n, and as n — ©, the variance of Z approaches zero. Since 
Ef) =n 

for any n, it follows that the distribution of Z must become concen- 
trated at u when o?/n — 0. 

A consistent estimator is obviously unbiased in the limit, but for 
finite sample sizes it may be biased though in such a way that the bias 
approaches zero as n becomes large. An unbiased estimator may or 
may not be consistent depending on whether or not its distribution 
becomes concentrated near its mean as the sample size increases. In 
estimating the mean, for example, we might define an estimator 
6 = zi, where z; is the first observation of the sample; this estimate is 
unbiased but not consistent. 

Efficient. In a great many estimation problems it is possible to 
construct estimators Ó(z: zs, * * * , En), such that y/n(ô — 0) has a 
normal distribution with zero mean in the limit as the sample size n 
increases. Confining our attention to this class of estimators (and 
assuming such a class exists), there may be one or more estimators 
which will have a limiting variance which is smaller than the limiting 
variances of the other estimators. These estimators which have the 
smallest limiting variance are called efficient estimators of 0. 

Tt can be shown, for example, that for samples drawn from a normal 
population with mean y and variance c?, 6, = Zis an efficient estimator 
of pu. In fact, the limiting distribution of /n(Z — uj) is normal with 
zero mean and variance c?. No other estimator can have a smaller 
limiting variance. However, there are many other efficient estimators, 
ie, estimators with the same limiting normal distribution. For 
example, 


; CMS 
barrie 


is efficient since it can be shown that +/n(62 — u) has a normal dis- 
tribution with zero mean and variance g? in the limit as n becomes 
large. It is to be observed that 0; is biased, since 


EÂ) = — —. 


uA————————COTL———————— "an 


PROPERTIES OF GOOD" ESTIMATORS §8.2 


and in general efficient estimators need not be unbiased for finite 
samples though they are clearly unbiased in the limit. Efficient 
estimators are necessarily consistent. 

Sufficient. An estimator is said to be sufficient if it contains all the 
information in the sample regarding the parameter. More precisely, 
if a1, ta, * * * , % is a sample from a population with density f(x; 0) 
and if (a1, x», * * * , tm) is an estimator such that the conditional dis- 
tribution of zi, zs, * * * , 2» given 6 does not depend on 6, then bisa 
sufficient estimator. This means that the joint density of the sample 
may be put in the form 


n 
IL fess 6) = gi 25 «+ + , 240); 0) 3) 
i=l 

where the function g does not involve 0. In this form it is clear that no 

other function of the a; can provide any information about 0. For 

consider any other function of the z; say W(t, 2, * * `, Ta). The 
distribution of w for a fixed 6 will be determined by the conditional 
density g(zi, ze, * * * , z,|) and will have 6 but not 0 as a parameter. 

Hence u can only provide information about Ê. But 6 is known in 

any given problem, so that any information provided by u is of no use. 

Sufficient estimators are obviously the most desirable kind of esti- 
mators to have, but unfortunately they do not exist except in rather 
special cases. Ordinarily we shall have to be content with less 
satisfactory estimators. 

We have defined all these concepts in terms of one parameter, but 
the extension to several parameters is straightforward. Thus ig is 
distributed by f(x; 01, 02, * * * , 0x), a set of estimators 06,0, °° * Oy 
is unbiased if, for every 7, 

E(6;) = 6; 

The set is consistent if, for every 7, 

P(6; — 6) > 1 asn— © 


where n is the sample size. The set is sufficient if 


n 
I ese 061,04 = * 5,0) = GCs ee E Ba S bx) 
i=1 

h(6,, bs, SIT M , $5 [tyr [9] 
The generalization of the meaning of efficient requires some knowledge 
of the multivariate normal distribution, a distribution which we shall 


study in the next chapter. If k variables wu; Uz * * * , "s have à 
151 


§8.3 POINT ESTIMATION 


multivariate normal distribution, it can be shown that there are linear 
functions Vi, Vs, © * * , Vr of the w; which are independent in the prob- 
ability sense and each of which has the simple normal distribution, so 
that the multivariate normal distribution of the u; may be written 
as the product of k single-variate normal distributions of the Vj. 
(This is illustrated in Prob. 25 of Chap. 9.) A set of estimators is 
efficient if y/n (6; — 0) have the multivariate normal distribution in 
the limit as the sample size increases, and if the linear functions V; of 
the y/n (0; — 0) which are independent in the probability sense are 
such that the product of their variances is a minimum. 

8.3. Principle of Maximum Likelihood. To introduce the idea, we 
shall consider a very simple estimation problem. Suppose an urn 
contains a number of black and white balls, and suppose it is known 
that the ratio of the numbers is three to one but that it is not known 
whether the black or the white balls are the more numerous. That is, 
the probability of drawing a black ball is either 14 or 34. If n balls 
are drawn with replacement from the urn, the distribution of the 
number of black balls is given by the binomial 


fe; p) = () rq (1) 


where q = 1 — p and p is the probability of drawing a black ball. 

We shall draw a sample of three balls with replacement and attempt 
to estimate the unknown parameter p of the distribution. The esti- 
mation problem is particularly simple in this case because we have 
only to choose between the two numbers .25 and .75. Let us anticipate 
the result of the drawing of the sample. The possible outcomes and 
their probabilities under the two possibilities are given below: 


The principle of maximum likelihood essentially assumes that the 

sample is representative of the population. We shall state it more 

precisely later. In the present example, if we found z = 0 in a sample 

of three, the estimate .25 for p would be preferred over .75 because 
152 


PRINCIPLE OF MAXIMUM LIKELIHOOD §8.3 


the probability 274 is greater than 164, i.e., because a sample with 
æ = 0 is more likely to arise from a population with p = 14 than from 
one with p = 34, And in general we should estimate p by .25 when 
x = Oor 1, and by .75 when x = 2or3. The estimator may be defined 
as 
p(z) = .25 0/1 (2) 
s T 2-23 


The estimator thus selects for every z the value of p such that 
Fe; b) > fle; p’) 


where p’ is the alternative value of p. 

More generally, if several alternative values of p were possible, say 
p = 0.1, 0.2, 03, - - * , 1.0, we might reasonably proceed in the 
same manner. Thus if we found x = 6 in a sample of 25 from a 
binomial population, we should substitute all possible values of p in 
the expression 


16:2 = fe p'ü - p @) 


and choose as our estimate that value of p which maximized f(6, p). 
For the given possible values of p we should find our estimate to be 
p(6) = .2. If there were no restriction on p except that 0 < p < 1, 
then f(6, p) would be regarded as a continuous function of p over the 
given interval and the position of its maximum value would be found 
by putting its derivative with respect to p equal to zero and solving 
the resulting equation for p. Thus, 


iq»-()wa-»"ma-»-:99 W 


and on putting this equal to zero and solving for p, we find p — 0, 1, 
945 are the roots. The first two roots are impossible as far as the 
given sample is concerned, and our estimate is therefore p = 945. 
This estimate has the property that 


f(6; p) > f(6; ^) (5) 


where p’ is any other value of p in the interval 0 < p < 1 
The principle of maximum-likelihood estimation is simply this: 
If f(as, zs, © + + , 2,5 0) is the density for a random sample of size n 
drawn from a population with an unknown parameter 8, then the mazi- 
153 


§8.4 POINT ESTIMATION 


mum-likelihood estimate of 8 is the number 6, if it exists, such that 
Jln zs, ++ +, nj Â) > fer, my > > 2 0) 


where 0' is any other possible value of 0. 

While we have been discussing a discrete distribution in particular, 
the principle is the same for a continuous distribution. Suppose x is 
continuous and has the density f(x; 0). The probability that x will lie 
in a small interval Az is approximately f(x; 0)Az. Given a sample of 
one observation, 21, we may choose arbitrarily a small interval Az 
about xı and maximize the probability f(a, 0)Ax as a function of 0. 
However, since Az is arbitrary, it is not a function of @ and behaves 
as a constant in so far as variations in @ are concerned. Hence in 
the maximization we may disregard Az and deal only with f(x, 0). 
The conclusion is obviously the same for samples of more than one 
observation. 


n 
The function [[f(z;; 6), which gives the sample distribution when 
1 


regarded as a function of the 2;, is regarded as a function of 0 for fixed 
values of the x; in determining the maximum-likelihood estimate of 0. 
When regarded as a function of 6, the expression is often referred to 
as the likelihood function of 0. The maximum-likelihood estimate of 
0 is therefore the point at which the likelihood function has a maximum. 

When more than one parameter is involved, the maximum-likelihood 
estimates of the parameters are defined similarly. Thus if a sample of 
size n has the density 


n 
If; 91, 0 > * +, 8) 
i-1 
then the maximum-likelihood estimates of the parameters are the 
numbers 6;, 62, - * + , Ôr, if such a set exists, which maximize the given 
expression as a function of the 6;. It often happens in practice that 
one wishes to estimate some but not all of the unknown parameters 
of a distribution. Usually it turns out that the maximizing values 
for the desired set of parameters depend on the remaining parameters, 
so that it is necessary actually to estimate all the unknown parameters. 
8.4. Some Maximum-likelihood Estimators. We shall obtain in 
this section maximum-likelihood estimators for parameters of some 
of the common distribution functions. Ordinarily the parameters 
may be regarded as continuous variables, and the maximizing value 
may be obtained by putting the drahe of the likelihood function 
equal to zero and solving for the parameter in the resulting equation. 
154 


SOME MAXIMUM-LIKELIHOOD ESTIMATORS §8.4 


Since likelihood functions are products, and since sums are usually 
more convenient to deal with than products, it is customary to maxi- 
mize the logarithm of the likelihood rather than the likelihood itself, 
i.e., to maximize 
n n 
L = log [I f(e; 0) = Y, log f(a; 0) (1) 
i=1 ici 
Of course the logarithm of the likelihood has its maximum at the same 


point as does the likelihood. 
Binomial. Suppose samples of size n are drawn from the binomial 


distribution 
Jz; p) mr z:-01 (2) 
The sample values, tı, 2s, * * * , m Will be a sequence of zeros and 
ones, and the likelihood is 
Tl pg = p2ign-2 3 
i=1 


and letting y = £z; we have 


L = y log p + (n — y) log q (4) 
dL y n-—y 5 
dp p q 6) 


remembering that g = 1 — p. On putting this last expression equal 
to zero and solving for p, we find the estimator 
1 m 
p=2=-Jn=3 (6) 
which is, of course, the obvious estimator for this parameter. ‘ 
We can show that this estimator is sufficient and therefore that it 
would be fruitless to search for a better estimator for the parameter. 
We need to show that the conditional distribution of the a; given 7 is 
independent of p. Since the marginal distribution of nz = y is given 


by 
Que ^ 


the conditional distribution of the z; given y is obtained by dividing 
(3) by (7) to get, say, 
if ii 
g(r, Ta, ^ * p nlb) = 7 
(a) 
a distribution which is independent of the parameter p. 
155 


x = 0,1; Ez; = np (8) 


$84 POINT ESTIMATION 


Normal. Samples of size n from the normal distribution have the 
density 
Il 1 g Qe) = ENA e 0/20?) E(zi—u)? (9) 
ire (s) 


i=l 


The logarithm of the likelihood is 


n n : 1 
L = — 5 log 2x — 5 log o? 22), ees (10) 
'T'o find the location of its maximum, we compute 
oL 1 
$4 7320-2 (11) 
oL — mnl 1 de FM 
si Sa 50, ww (12) 


and on putting these derivatives equal to zero and solving the resulting 
equations for u and o?, we find the estimators 


a=1}u=8 (13) 
=i Y (ua)? (14) 


which turn out to be the sample moments corresponding to u and c?. 
The estimator ĝ is unbiased, but é? is not, since 


Tho 


B) = t— 


gt. (15) 


We shall see later that this pair of estimators is a sufficient pair for 
estimating the parameters; the sample distribution for given values of 
& and ¢? does not involve u and c?. We note in this case that it is 
possible to estimate y without estimating c?, but not possible to esti- 
mate c? without first estimating p. 
Uniform. The density for samples of size n from the uniform distri- 
bution over the range a < z < Bis 
1 
@- a 
so that 
L = —n log (8 — a) (17) 
If we put the derivatives of this expression with respect to a and 6 
equal to zero and attempt to solve for « and £, we find that at least one 
156 


SOME MAXIMUM-LIKELIHOOD ESTIMATORS: §8.4 


of a, B must be infinite, a nonsensical result. The trouble here is that 
the likelihood does not have zero slope at its maximum value, so that 
we must locate its maximum by other means. It is evident from (16) 
that the likelihood will be made as large as possible when 8 — o is 
made as small as possible. Given a sample of n observations tı, t», 

+ , Za, suppose we denote the smallest of the observations by 2’ 
and the largest by z”. Clearly œ can be no larger than a’ and 8 can 
be no smaller than z”; hence the smallest possible value for 8 — a is 
a" — 2’. The maximum-likelihood estimators are obviously 


dr (18) 
B =," 


a somewhat curious result because no use is made of the intervening 
observations. 


rZ 
Fie. 40. 


These three examples are sufficient to illustrate the application of 
the method of maximum likelihood. The last example shows that one 
must not rely on the differentiation process to locate the maximum. 
The function L(@) may, for example, be represented by the curve in 
Fig. 40, where the actual maximum is at 6, but the differentiation 
process would locate 6’ as the maximum. One must also remember 
- that the equation 90L/00 = 0 locates minima as well as maxima, and 
hence one must avoid using a root of the equation which actually 
locates a minimum. 

We have not illustrated the estimation of a parameter which appears 
as a factorial in the distribution function. This may be done in any 
given problem with the aid of tables of the derivative of the factorial 

157 


88.5 POINT ESTIMATION 


function. However, such a problem arises so rarely that it is not worth 
while to study it here. The parameters—n in the binomial distribu- 
tion, æ in the gamma distribution, and œ and £ in the beta distribution 
—are usually determined by the sample size and need not be estimated 
since the sample size is ordinarily known. 

8.5. Properties of Maximum-likelihood Estimators. There is no 
general argument which will show that maximum-likelihood estimators 
are the best possible estimators. There is, in fact, no way of dealing 
with the estimation problem (or any other problem requiring induc- 
tive inference) completely within the framework of the theory of prob- 
ability. The theory of probability as a branch of mathematics is a 
deductive science—given certain axioms, certain conclusions neces- 
sarily follow. Uncertain conclusions are outside the realm of the 
theory. It is precisely here that statistics departs from that theory 
and becomes an independent discipline. New axioms are required to 
deal with the problems of statistics; one such axiom might be the 
principle of maximum likelihood. Whether the new axiom is good or 
not from the practical viewpoint is, of course, of no interest from the 
strictly logical viewpoint. When a new axiom is added to a given set 
of axioms, a new theory involving additional theorems arises, and from 
the logical viewpoint the only requirement of the new axiom is that 
it be consistent with the other axioms. 

We cannot, therefore, hope to prove that a new axiom or principle is 
right or wrong. From the practical viewpoint, we naturally want an 
axiom that will give rise to a useful theory of estimation. In framing 
such a principle, one would first consider what he wanted the theory 
to do in practice in terms of certain intuitively desirable criteria 
(unbiasedness, consistency, for example) and then try to formulate a 
principle which would lead to such a theory. The principle of maxi- 
mum likelihood, which is due to R. A. Fisher, forms one basis for à 
theory of estimation. Other principles would lead to different theories. 
A choice between principles is, in the last analysis, a matter of opinion 
as to what is a good theory. After examining the properties of maxi- 
mum-likelihood estimates, it will become apparent that Fisher’s prin- 
ciple leads to a very useful theory, and that for general purposes of 
estimation there is little if any room for improvement in the theory. 

Bias. Maximum-likelihood estimators are not, in general, unbiased, 
as we have already seen in the case of the variance of a normal popula- 
tion where 


"n 


E(e) = E D (s — »:| De () 
168 


PROPERTIES OF MAXIMUM-LIKELIHOOD ESTIMATORS §8.5 


In this case the estimator could be made unbiased by multiplying it 
by n/(n — 1) to obtain the estimator 


a=- Ha-a @) 
which is an unbiased estimator of c?. And in general, when maximum- 
likelihood estimators are biased, it is possible to modify them slightly 
so that they will be unbiased. 

Tf one requires his estimators to be unbiased, he is using an additional 
principle which is somewhat in conflict with the principle of maximum 
likelihood. While there is no particular harm in this (aside from a 
minor logical inconsistency), there is really nothing to be gained by it. 
The only claim for unbiasedness as a good criterion is that it forces the 
distribution of the estimator to be centered (in the center-of-gravity 
sense) at the true parameter value. But one could just as well require 
the median, for example, of the distribution to be the true parameter 
value. Or some other central value might be used. The point is 
that all one can ask is that the true parameter value be somewhere 
near the center of the distribution of the estimator. He may choose to 
define the center however he pleases [mean, median, point such that 
E(Ó — 8)! is minimized], but as between reasonable definitions of 
* center" there is not much choice. 

Maximum-likelihood estimators do, in fact, have the true parameter 
values near the centers of their distributions; we shall not be concerned 
if the parameter does not happen to be at the exact center of gravity 
of the distribution. 

Invariance. A particularly convenient property of maximum-likeli- 
hood estimators is the fact that if 6 is the maximum-likelihood esti- 
mator for 6, and if u(@) is any single-valued function of 0, then u(4) 
is the maximum-likelihood estimator for w(0). This is easily seen to 
be the case. Let 


L(6) = Š log Jui; 8) 
1 


Instead of estimating @ we wish to estimate u(6). The function u(0) 
defines an inverse function 0 = v(u). The estimator @ for u is the 
value of u which maximized L[v(u)]. Since the largest value of L 
occurs at 0 = 6, it follows that v(u) must equal à, and hence that 
ú = u(6), since u is the inverse function of v. 

On the basis of this argument we can conclude directly, for example, 
that the maximum-likelihood estimator of the standard deviation of a 

159 


§8.5 POINT ESTIMATION 


normal distribution is 


t= VE =q] Y @- 5? 


Similarly since the fourth moment about the mean for a normal popu- 
lation is u, = 30+, it follows that the maximum-likelihood estimator for 


ua is 
fis = 3(62)? = 3 [ le- »l 


not the fourth sample moment, m4, about the mean as might have been 
anticipated, Of course, m, could be used as an estimator for p4, but 
an examination of the sampling distribution of ñ, and m, would show 
that the former has a distribution which is more closely concentrated 
about ps. 

In general, since the moments of a population are ordinarily func- 
tions of the parameters that appear in the distribution function, it 
follows that the maximum-likelihood estimators of the moments are 
the same functions of the estimators of the parameters. Thus the 
rth moment of a population with density f(x; 0) will be some function, 
say u!(0), of 6. The maximum-likelihood estimator of the parameter 
will therefore be (6), where 6 is the maximum likelihood estimator of 
0. 

Sufficiency. Not all parameters have sufficient estimators, but if a 
parameter does have sufficient estimators, it can be shown that the 
maximum-likelihood estimator will be a sufficient estimator. The 
proof of this statement is of a somewhat advanced mathematical 
character and will be omitted. 

Efficiency. When we examine the large-sample distribution of 
maximum-likelihood estimators in a later chapter, we shall see that 
under fairly general conditions the quantity 4/7 (6 — 0) is asymp- 
totically normally distributed with a finite variance; furthermore no 
other asymptotically normally distributed estimator can have a smaller 
variance. It follows then that maximum-likelihood estimators are 
efficient and incidentally are consistent estimators. 

All these properties show that the principle of maximum likelihood 
leads to a very satisfactory theory of estimation. However, perhaps 
the most important character of the theory from a practical standpoint 
is of a different kind. It is easy enough to set up in theory a system of 
estimation by specifying certain properties the estimators should have, 
but to find the actual functional forms of the estimators may be a very 
difficult matter. The theory of maximum likelihood does not have 

160 


| PROBLEMS 88.7 


any difficulty of this kind. The estimating functions are determined 
directly by the maximization process. Thus the theory is eminently 
satisfactory on two counts: it gives estimators which have desirable 
properties, and the estimators are easy to find. 

8.6. Notes and References. Fisher’s paper in which the principle 
of maximum likelihood was first expounded is cited below. Before 
the publication of this paper, the customary method for estimating 
parameters was the method of moments. If a distribution function 
involved r parameters 6;, 02, * * * , 9;, this technique called for finding 
the first r population moments as functions of the parameters: 


u$ 05 °° * , Or) = JE aif(x; 01, 02, * * * , 0,)dx 


then equating the sample moments to these functions, and solving the 
resulting equations for the parameters. In afew instances this method 
gives the same estimators as does the maximum likelihood method, but 
generally the estimators are different. 

Fisher was able to demonstrate that his maximum-likelihood esti- 
mators were usually far superior to those obtained by the older 
method. In the second paper cited below he further showed that 
maximum-likelihood estimators could not be essentially improved. 
Thus Fisher virtually solved the whole problem of point estimation in 
these two remarkable papers. 


1. R. A. Fisher: “On the mathematical foundations of theoretical 
statistics,’ Philosophical Transactions of the Royal Society, Series 
A, Vol. 222 (1922). 

2. R. A. Fisher: “Theory of statistical estimation,” Proceedings of the 
Cambridge Philosophical Society, Vol. 22 (1925). 


8.7. Problems 


1. Is the sample mean necessarily an efficient estimator of the 
population mean for every population? 

2. If an estimator is unbiased, can it 
samplings, to underestimate the true param: 
overestimate it half the time? 


be expected, for repeated 
eter half the time and 


20 
3. For samples of size 20 find the efficiency of Zi = ym relative to 
10 ; š 
9, — Ko » 2; as estimators of the population mean. 
1 


4. If ô is a sufficient estimator of 0 and if 4(@) is a function of 6, is 


(6) a sufficient estimator of u? 
161 


88.7 POINT ESTIMATION 


5. Find the maximum-likelihood estimator for 8 given a sample of 
size n from a population with f(x) = 1/8, 0 < x < B. 

6. The sample 1.3, 0.6, 1.7, 2.2, 0.3, 1.1, was drawn from a popula- 
tion with the density f(z) = 1/8, 0 <x « B. What are the maxi- 
mum-likelihood estimates of the mean and variance of the population? 

7. What is the maximum-likelihood estimator for a in the density 
f(z) = (œ+ 1) 0<% <1? 

8. Assuming o known, find the maximum-likelihood estimator for 
B in the gamma distribution. 

9. Find the maximum-likelihood estimator for the parameter of 
the Poisson distribution. 

10. Find the maximum-likelihood estimator for the variance of a 
normal population, assuming the mean is known. 

11. Find the maximum-likelihood estimator for the variance of the 
gamma distribution, assuming « is known. 

12. If x is distributed by f(x) = 1/8, 0 < x < B, and one considers 
samples consisting of only one observation x, then since E(x) = 6/2, 
a reasonable estimator for 8 might be B; = 2x. On the other hand, 
the maximum-likelihood estimator for 8 is B» — v. Is there any 
choice between these two estimators on grounds of relative efficiency ? 

13. If x is normally distributed with mean yu and variance c?, find, 
for samples of size k, the maximum-likelihood estimator of the point A 


such that ht n(x; m o?)dz = .05. 

14. It is shown in Chap. 10 that the mean of a sample from a normal 
population is exactly normally distributed. Use this fact to show that 
the sample mean is a sufficient estimator of the population mean. 

15. In genetic investigations one frequently samples from a binomial 


f(x) = B) p*q"— except that observations of z = 0 are impossible, 


so that in fact the sampling is from the conditional distribution 


bbs a T 
sey =(") PEE s-r% m 


Find the maximum-likelihood estimator of p in the case m = 2 for 
samples of size n. 
16. Find the estimator for a in the density 


fe; a) = 3 e — a) Arsa 


for samples of size 2. 
162 


PROBLEMS §8.7 


17. Referring to Prob. 16, what is the maximum-likelihood estimator 
of the population mean? 

18. An urn contains black and white balls. A sample of size n 
is drawn with replacement. What is the maximum-likelihood esti- 
mator of the ratio R of black to white balls in the urn? 

19. Referring to Prob. 18, suppose one draws balls one by one with 
replacement until a black ball appears. Let x be the number of 
draws required (not counting the last draw). This operation is 
repeated n times to obtain a sample 21, 2s, * * * , a. What is the 
maximum-likelihood estimator of R on the basis of this sample? 

20. Suppose n cylindrical shafts made by a machine are selected at 
random from the production of the machine and their diameters and 
lengths measured. It is found that mu have both measurements 
within the tolerance limits, nız have satisfactory lengths but unsatis- 
factory diameters, n»; have satisfactory diameters but unsatisfac- 
tory lengths, and ms» are unsatisfactory as to both measurements. 
Eng = m. Each shaft may be regarded as a drawing from a multi- 
nomial population with density 


pu?UpiyUpa(l — pu — Piz — pu)" aj = 0,1, Zz; = 1 


having three parameters. What are the maximum-likelihood esti- 
mates of the parameters if mi; = 90, nız = 6, mar = 3, M22 = 1? 

21. Referring to the above problem, suppose there is no reason’ to 
believe that defective diameters can in any way be related to defective 
lengths. Then the distribution of the £i; can be set up in terms of two 
parameters: pı, the probability of a satisfactory length, and qi, the 
probability of a satisfactory diameter. The density of the xy is then 


(mgla — qt — poal[(. — px)(L — q)r* 


zy = 0,1, Day = 1 


What are the maximum-likelihood estimates for these parameters? 
Are the probabilities for the four classes different under this model 
from those obtained in the above problem? 

22. A sample of size n; is to be drawn from a normal population 
with mean p and variance oj. A second sample of size nz is to be 
drawn from a normal population with mean p2 and variance «3. What 
is the maximum-likelihood estimator of e = pı — #2? Assuming the 
total sample size n = nı + ne is fixed, how should the n observations 
be divided between the two populations in order to minimize the 


variance of â. 
163 


§8.7 POINT ESTIMATION 


23. Suppose intelligence quotients for students in a particular age 
group are normally distributed about a mean of 100 with standard 
deviation 15. The I.Q., say x, of a particular student is to be esti- 
mated by a test on which he scores 130. It is further given that test 
scores are normally distributed about the true I.Q. as a mean with 
standard deviation 5. What is the maximum-likelihood estimate of 
the student’s I.Q.? (The answer is not 130.) 

24. A sample of size n is drawn from each of four normal populations, 
all of which have the same variancec?. The means of the four popula- 
tions area +b+c,a+b—c,a—b+c,a—b-—c. What are the 
maximum-likelihood estimators of a, b, c, and o?? (The sample 


observations may be denoted by ty, i = 1, 2,3,4,andj = 1, 2, * * * , 
n.) 

25. Observations 21, ze, * * * , z, are drawn from normal popula- 
tions with the same mean yu but with different variances o], e$, ^ * " , 


c?. Is it possible to estimate all the parameters? Assuming the o? 
are known, what is the maximum-likelihood estimator of u? 

26. Is ĉi, the square root of the expression on the right of equation 
(5.2), an unbiased estimate of c? 


164 


us 


CHAPTER 9 
THE MULTIVARIATE NORMAL DISTRIBUTION 


9.1. The Bivariate Normal Distribution. The bivariate normal dis- 
tribution is a generaliZation of the normal distribution for a single 
variate. The density has the form 


fnp-— reer ee a 
Qrowy V1 — p? 


and may be represented by a bell-shaped surface z = f(x, y) as in 
Fig. 41. Any plane parallel to-the z, y plane which cuts the surface 


z=f(x,y) for f>k 


will intersect it in an elliptical curve, while any plane perpendicular 
to the x, y plane will cut the surface in a curve of the normal form. 
The probability that a point (v, y) drawn at random will lie in any 
region R of the z, y plane is obtained by integrating the function over 
that region, 


Pi(x, y) isin R] = f [ f, tu de © 
R 


The function might, for, example, represent the distribution of hits 
on a vertical target (Chap. 4) where z and y represent the horizontal 
165 


" E 


§9.1 THE MULTIVARIATE NORMAL DISTRIBUTION 


and vertical deviations from the central lines. And in fact the dis- 
tribution closely approximates the distribution of this as well as many 
other bivariate populations encountered in practice. 

We must first show that the function actually represents a distribu- 
tion by showing that its integral over the whole plane is one, i.e., 


[Fes yas =1 a 


The function will, of course, be positive if —1 < p < 1. To simplify 
the integral, we shall substitute 


T — Hz 
uam : 
=- (4) 
y= ti 
gy 
so that it becomes 
E] LÀ 1 
s AnA) dy du 
SEE lr yl pt 


On completing the square on w in the exponent, we have 


LJ LJ 1 
————— ed 20—?)l Gi po)*-(1— dy du 
it TE ere —p 


and on substituting 
MEUM CERE 


vig =e 


the integral may be written as the product of two simple integrals, 


ir 1 EX 
go: dw g- 012 dy (5) 
[vec E 


-v -w 


both of which are one, as we have seen in studying the univariate 
normal distribution. Equation (3) is thus verified. 

To obtain the moments of x and y, we shall find their joint moment 
generating function, say, 


mls, ta) = E(ehtt) (6) 
= [festevf(e, y)dy dx (0) 
Let us again substitute for x and y in terms of u and v to obtain 


ml, t2) = 


1 
SET. f $ ghosts — — — e120- 2pur+) dy du (8) 
2r v1 — p? 


166 
= 


THE BIVARIATE NORMAL DISTRIBUTION §9.1 


The combined exponents in the integrand may be written 

1 2 

ar — a) 

and on completing the square first on u and then on v, we find this 
expression becomes 


CUT. 
2(1 — p?) 


2puv + v? — 2(1 — p?)tioxu — 2(1 — p?)tsev] 


[u — pu — (1 — p?)sz]* + (1 — p) — ples — tzo)? 


— (1 = p?) (tet + 2pliterzoy + t303) } 
which, on substituting 
u — pv — (1 — p*)tice 
VE 
z =v — ply; — ty 
becomes 
— 14w? — Met + Viet + 3phiaroy + t) 


and the integral in (8) may be written 
o fo 
m(t, t2) = tists ttapy eh (t102*+2ptita0z0yHt3t0y?) 1 e7213 dw dz 
n Arg o g 
= elite tomy tH (tos Iptitansry ttatay?) (9) 


since the integral is obviously one. 
The moments may be obtained by evaluating the derivatives of 
mtr, t2) at tı = 0, t = 0. Thus, 


om 
xi ue = us 10 
Ee) m oo f $5 
_ em a A 11 
EG’) = og a ul d- ei (11) 


hence the variance of x is 

E(r — pe)? = E(x?) — ub = 02 (12) 
Similarly, on differentiating with respect to /», one finds the mean and 
variance of y to be py and o?. We can also obtain joint moments 

Ey’) 
by differentiating m(t, t2) r times with respect to tı and s times with 
respect to tz then putting ¢; and f» equal to zero. The covariance of 
t and y is 
El(x — uJ(y — u)] = Ely — thy — Uus + uam) 
= E(ay) — uzty 


‘= poy 
167 


(13) 


§9.1 THE MULTIVARIATE NORMAL DISTRIBUTION 


as may be verified by differentiating m(t, te) once with respect to each 
variable, then putting the variables equal to zero. The parameter p 
is called the correlation between x and y. When the correlation is zero, 
it will be observed in (1) that f(x, y) becomes the product of two uni- 
variate normal distributions; hence in this case (9 = 0), z and y will 
be independent in the probability sense. 

The marginal density of one of the variables, z, for example, is by 
definition 


fie) = [ 7, fi, dy (14) 
and again substituting 


n= 
ae! By 
gy 


and completing the square on v, one finds 
© 1 E (ey al Go") 
= er 2N a 20-55 Ts 
fiG) di o no, VI — pio s a 


Then the substitution 


v — p(x — uz)/o.] dv 
w= dw = 
MA — p? id 1— p 
shows at once that 
iy 
= E 15 
e areis d E. 


the univariate normal density. Similarly the marginal density of y 
may be found to be 


1 TNT. zy 
—-———— e 2\ e, 16) 
fo(y) Rr arm ( 
Having the marginal distributions, it is possible to determine the 
conditional distributions. Thus the conditional density of x for fixed 
values of y is 


— fG, y) 
xe 


and after substituting for the functions on the right, the expression 
may be put in the form 


1 A 2 
Pare [77-2] (17) 


t. 1 
Sely) = to ear 


which is a univariate normal density with mean, uz + (ecz/ou)(y — m» 
168 


THE BIVARIATE NORMAL DISTRIBUTION §9.1 


and with variance, o2(1 — p°). The conditional distribution of y 
may be obtained by interchanging z and y throughout (17) to get 


1 1 
€ 3e ü-p) [17-16-22] (18) 


1 
fule) Var EE 
The mean value of a variate in a conditional distribution is called the 
regression function when regarded as a function of the fixed variates 
in the conditional distribution. Thus the regression function for x in 
(17) is uz + (poz/cy)(y — uy), which is a linear function of y in the 
present ease. For bivariate distributions in general, the mean of x 


f 


X5x rE (y--y)/%y 


Fic. 42. 


in the conditional distribution of z will be some function, say g(y), and 
the equation 


x = gly) 
when plotted in the z, y plane gives the regression curve for x. It is 
simply a curve which gives the location of the mean of z for various 


values of y. | 
For the bivariate normal distribution, the regression curve is the 


Straight line obtained by plotting 
z = m + 2 (y — wy) (9) 
Cy 
as shown in Fig. 42. The conditional density of 2, f(ely), is also 


plotted in the figure for two particular values, yo and yı, of y. 
The cumulative bivariate normal distribution 


F( y) = f? f” 10, ddtds 


169 


89.2 THE MULTIVARIATE NORMAL DISTRIBUTION 


may be reduced to a form involving only the parameter p by the sub- 
stitution (4). Thus, 


x : 1 —[1/2(1—p?)] (a? —2 pat+t? 
Fo) = hom [^ f getter amet aa 


The function Fo(u, v) is tabulated for p = 0, .05, .10, + + + , .95 in 
Karl Pearson’s ‘‘Tables for Statisticians and Biometricians" (Part I, 
Cambridge University Press, London, 1914). 

9.2. Matrices and Determinants. It is apparent, from our study 
of the bivariate normal distribution, that an investigation of the 
k-variate normal distribution may involve some very unwieldly alge- 
braic expressions. In order to simplify such expressions, it is worth 
while to develop briefly the algebra of matrices. 

A matrix is any rectangular array of quantities. For example, 


3 O logz| 
&* a fy) 


is a matrix with two rows and three columns. "The matrix is nothing 
more than the set of quantities; no operation on the quantities is 
implied by writing them in such an array. The coordinates (v, y) 
of a point in a plane may be regarded as a matrix ||x, y| with one row 
and two columns. A sample of n observations (21, Y1), (xs, ys); ^ ^ ^ 5 
(€n, y») from a bivariate population may be regarded as a matrix 


i Yı 
P» y» 
Vn Yn 


with n rows and two columns, or alternatively as a matrix 


Mi X2 t7 0 Zn 
Vite Yar See aye 


with two rows and n columns. The individual quantities which make 
up the matrix are called elements of the matrix. 

: We shall be concerned with square matrices, which have the same 
number of rows as columns. A general expression for a square matrix 
is 

170 


MATRICES AND DETERMINANTS §9.2 


Qi, Gig Qi co^ ^ Qw 
azn Gee Gag ^'^ Qs 

(1) 
(ki Gee Ang c 7o Ok 


where the elements are represented by aj. The subscripts ? and j give 
the position of the element in the array. The first subscript designates 
the row, and the second one the column, Thus the element repre- 
sented by ds; lies in the fifth row and the seventh column. The top 
row is generally taken to be the first row, and the left-hand column the 
first column. The order of a square matrix is its number of rows or 
columns; the matrix in (1) is of order k. The set of elements a11, a2» 
ass, © * * , ay, are said to form the main diagonal of the matrix. A 
square matrix is symmetric if a; = aj; for all i and j, i.e., if the array is 
unchanged when the rows and columns are interchanged. Thus, 


a 0 z 
0 b y 
C 


is à symmetrie square matrix of order three. 

An algebra of matrices of the same order may be set up by defining 
the operations of addition, subtraction, multiplieation, and division. 
The sum of two matrices is the matrix of the ordinary sums of corre- 
sponding elements. Thus, 


x 44 jj k il atj b+k c+! 
d e f|-|m n o|—|d--m etn fo (2) 
p ho d |» « rl p» Ata in 


Subtraction is similarly defined. The product of two matrices is 
defined as follows: The element in the ith row and jth column of the 
product matrix is obtained by multiplying the elements of the ith 
row of the left-hand matrix by the corresponding elements of the jth 
column of the right-hand matrix and adding the results. Thus, using 
a dot to indicate multiplication, 


9» 5* ell lg) ee 
d e f|: m m o 
o À 3| ip g r 


aj + bm + cp ak +bn+cq al + bo + er 

dj + em + fp dk J- en -- fq dl + co fr 

gj + hm + ip gk + hn + iq gl + ho + ir 
171 


6) 


89.2 THE MULTIVARIATE NORMAL DISTRIBUTION 


It is to be observed that the product would be different were the order 
of the two matrices on the left reversed; multiplication is not com- 
mutative. Division will be defined later. 

We shall use the symbol ||a;]| to represent a general square matrix 
of order k; i.e., ||a;]| represents the array given in (1). In this nota- 
tion, the definitions of addition, subtraction, and multiplication are 


lass] + los] = las + byll (4) 
lasl llbsll = || X, abu] (5) 


The unit matrix is defined to be the matrix which has ones for the 
main diagonal elements and all other elements zero. Thus, 


1 0.0 
Or 1.0 
0 0, 1 


is the unit matrix of order three. We shall use the symbol à; to repre- 
sent the elements of the unit matrix; thus ô; is defined by 


63 = 1 t=j 
6 
=0 ij (6) 
It is easily verified that 
ls] * les] = lass] * [18a] = lasl (7) 
The unit matrix plays the same role in matrix algebra that unity does 
in ordinary algebra. 

Certain matrices have corresponding inverse matrices. The inverse 
of a matrix ||a;]| is a matrix, with elements which we shall denote by 
ai, such that 

lla? - las] = lêz] (8) 


Thus the inverse of a matrix corresponds to the quantity 1/c associated 
with a quantity c in ordinary algebra. Division of matrices is defined 
in terms of the inverse matrix of the denominator. Thus, 


ET - |laxl| is defined to be [4] - [Jail (9 


The inverse of a matrix is often indicated by putting the exponent — 1 
on the matrix. Thus if a matrix |b;]| has an inverse matrix with 
elements bř, that fact is usually indicated by writing 


Ipsi = losl 
172 


MATRICES AND DETERMINANTS §9.2 


Since multiplication is not commutative in general, it follows that 
[b^ -]la;]] will im general be different from |la;| -||s]|-^. How- 
ever, it can be shown that a matrix is commutative with its own 
inverse: i 


[lava * lla] = lar asl] = ll (10) 


Our principal problem in connection with matrices will be to find 
the inverse of a given matrix. This is most easily done by means of 
determinants. x 

The elements of a matrix may be used to form a determinant. We 
may recall the properties of determinants that are of primary interest 
here. A determinant is a particular function of a square array of 
elements, ||ayl|, namely, the polynomial 


Z + auai © 0 (11) 


where the sum over ti, 4s, * * * , t is taken over all permutations of 
1,2,3, + + > , k, and where the sign is plus or minus according as the 
permutation (4, ds, * * * , à) is an even or odd permutation of 1, 2, 3, 
- ++, k (ie, according as the integers in (in do, + + * , %) must be 
interchanged an even or odd number of times to bring them into the 
order 1, 2, 3, + - - , k). The function (11) is usually represented by 
the array in (1) except that single vertical bars instead of double bars 
are employed. We shall use the letter A to represent the determinant 
of the elements aij. 


(dii 12 [21^ 
21 Q22 [5 

A= | = Z E auan + * + Gin (12) 
aki a2 * * * Qu 


The cofactor of any element ay is the determinant of order k — 1 
formed by omitting the ith row and jth column of A multiplied by 
(—1)#i, We shall denote the cofactor of aj; by Ay. Thus, 


li; Gig Ga4 Qg °°’ Oak 
azı G32 G34 Qa "ot * Qa 
dai Gag Qs Gas 7 * OAK 


Ass = (—1* 


aki ka Ges Qus Co^ * OR 
173 


89.2 THE MULTIVARIATE NORMAL DISTRIBUTION 


It is shown in the elementary theory of determinants that the value 
of a determinant may be obtained by adding the products of the 
elements of any row by their cofactors, i.e., 


A = andia + aizAis + +++ + aA 
k 
e emis (13) 
j-1 

where any value of i may be used. By means of this result, the 
problem of finding the polynomial expansion (11) of a determinant is 
reduced to the problem of expanding determinants A;; of one less order. 
The determinants 4;; may be further reduced to expressions involving 
determinants of order k — 2, and so on. Thus, always expanding 
on the first row, for example, the function represented by a determi- 
nant of order three may be found as follows: 


(ou d 
«s lente d e 
E RES J^ Te i 


= a(eji| — flh|) — b(ali| — flal) + c(d|h| — elgl) 
= aei — afh — bdi + bfg + cdh — ceg 


since |z| = x by (11). 
One other property of determinants which we shall require is 
k 
$ ajAm = 0 izm (14) 
jel 
If the elements of any row are multiplied by the cofactors of the cor- 
responding elements of any other row, the sum of the resulting products 
will vanish. 

We can now determine the inverse of a given matrix in terms of its 
elements. Suppose the determinant of |ja;;|| is not zero. We shall 
show that the elements a? of the inverse of ||a;;|| are 

teils 

aï = T (15) 

where A is the determinant of ||a;;|| and Ax is the cofactor of aj. To 
do this, we need only show that 


lasl - [lal] = lòs] 
By definition of a product, the element cy, say, in the product is 
C = > Aina”? 
m 


174 


MATRICES AND DETERMINANTS §9.2 


From (13) it follows that the sum is equal to A when 7 = j, and from 
(14) the sum is zero when 4 j. Hence we have at once that cj = ôy. 

If the determinant of a matrix vanishes, it is impossible to define 
its inverse, and division by such matrices is not possible. This situa- 
tion is not entirely analogous to division by zero in ordinary algebra 
because there are many matrices with vanishing determinants whereas 
there is only one quantity zero in ordinary algebra. 

Two properties of inverse matrices which we shall require later and 
which we state without proof are: (1) The determinant of the 
inverse of a matrix is equal to the reciprocal of the determinant of the 
original matrix. (2) If a matrix is symmetric, its inverse will also be 
symmetric. 

To illustrate the computation of an inverse matrix, we shall find the 
inverse of 


Ere in 
lai] |2 4 1 
0 3 
The determinant of the matrix is 
8 5 
lo =|2 4 1 
o. vs 


tsi PE 2 4 
-sli a- th J +h 1 
= 3(12 — 1) — (6-0) =27 


The cofactors of the elements are 


4 1 
Au = [i 3 = 1 
21 
A= iig A= -6 
24 
Av 7| 3|=? 
1 0 
n=- 3-73 


§9.3 THE MULTIVARIATE NORMAL DISTRIBUTION 


and so on; the complete matrix of cofactors is 


11 —6 2 
l4d-|-3 9 -3 
1 —3 10 


On dividing each element of this matrix by |a;| = 27 and interchang- 
ing rows and columns, we have the inverse 


1467 —347 M7 
—9$1  9$v —37 


26: —367 er 


as may be verified by multiplying this matrix by the original matrix 
to obtain the unit matrix of order three. 

9.3. The Bivariate Normal Distribution in Matrix Notation. We 
shall denote the two variates by z; and zs instead of x and y, and their 
means by & and £ in place of us and uy. (To use u; and pe for the 
means might result in some confusion with the moments about the 
mean for a single variate.) The variances of xı and zs will be denoted 
by ei: and cə instead of o? and o?. Instead of the correlation p, we 
shall use the covariance pessy as the fifth parameter and denote it by 
91» Or osi. Both c12 and oe: will be used, but it is to be remembered 
that they are equal and represent the same parameter. The matrix 


leq = 


gi C012 


(1) 


les = | oad 


will be referred to as the variance-covariance matrix or, more briefly, 
as the covariance matrix. It is a symmetric matrix. The determi- 
nant of the matrix is 


leu| = o11022 — oio (2) 
which in the earlier notation is 

les] = 0203(1 — p?) 3) 
The inverse of the matrix is 


cm omm 
i losl losl 
gii] = j 

llo“ re I (4) 
losl — leal 


which is symmetric since c12 = øz. In the earlier notation the ele- 
ments of the inverse are 
176 


THE MULTIVARIATE NORMAL DISTRIBUTION §9.4 


1 Sy p 
» HUE 1— p 
lle] = sx j p?) ode p?) (5) 
w=) Wi» 
The determinant of the inverse is 
S ji 
aj trees 
i lel 
1 
aa =F) 5, 


Now it is to be observed in (5) that the numbers c are essentially 
the coefficients of the terms in the exponent in equation (1.1). In 
fact, the exponent may be written as: 

—Mlo (a, — $)? + e" (zi — 1) (22 — iy) + o? (zi — £i) (22 — £2) 
+ o7*(x_ — £2)*] 
and the constant multiplier in the distribution may be written as 
v le" or Adi. 
2r 2r v [osl 
The bivariate density may thus be put in the form 


2 2 
-4D Xeiciu;)i-u) 


The double sum in the exponent is called a quadratic form in the vari- 
ables z; — &, the c? are called the coefficients of the quadratic form, 
‘and ||| is called the matrix of the quadratic form. 

9.4. The Multivariate Normal Distribution. The multivariate 
normal distribution may be thought of as the distribution of a popula- 
tion of objects or events which may be characterized by several vari- 
ables, say x1, 2s, * * * , 2. Thusa population of human beings may 
be characterized by their heights (v1), weights (5), head lengths (x2), 
arm lengths (x4), waist measurements (vs), and so on. A machine 
tool may produce steel shapes which may be specified by several 
measurements of lengths and angles, Each member of the population 
has a set of measurements (a1, zs, * * * » 1+); à sample of size n drawn 
from such a population would consist of n such sets of measurements. 

Geometric language is often used to describe a multivariate popula- 
tion. A given set of measurements (1, 2» * * * ; zp) is referred to 
as the set of coordinates of a point in a k-dimensional space. The 
Population consists of the points of the space. The distribution could 

177 


§9.4 THE MULTIVARIATE NORMAL DISTRIBUTION 


be plotted in a (k + 1)-dimensional space, and would plot as a so-called 
hypersurface consisting of the points [en tae Ze Jti €» "ts 
4z,)]. The statements are the immediate generalizations of the case 
of one- and two-variate populations. A distribution of a single variate 
x, say f(x), may be plotted in a two-dimensional space and consists 
of the points [z, f(x)] which lie on the curve y = f(x). A distribution 
of two variates z and y may be plotted as a surface in a three-dimen- 
sional space; the points of the surface z — f(x, y) have coordinates 


(x, y, F@, v). 


The multivariate normal density is 
k ok 


-4D Zeiu-t(i-t) 


1V2 
fx ym A vIe $7171 a) 


in which the matrix ||c*| of the quadratic form is symmetric and has a 
positive determinant. This is the direct generalization of the distri- 
bution given at the end of the preceding section. We shall see later 
that the inverse of the matrix ||c?|| of the quadratic form is the matrix 
of variances and covariances, and that the means of the a; are &. 
In order to show that 
LJ v k 
jes sous [fes a, >>>, a) II dz = 1 
i=1 

we shall integrate out one of the variables, say «1, by completing the 
square on that variable. First we shall change the variables to 


Yi = ti — & (2) 


to shorten the ensuing expressions. The quadratic form becomes 
EZXo"yg;. Completing the square on yı, we find 


k k x k k k k 
Y Mew = oyt Y ymt Y yyt YD oyy 
i=] j=1 i=2 i=2 i-2j-2 


oy; 


TUE E 
ew) Ap > X oyi 
2 2 


kk 
giga; + ` ) eva 
2 2 


li, 
(s — 2) Vili (3) 


ll 
e 
wt CN 
2 
S 
+ 
iw} 
= 
i= 
b= 
&; 
Ei 
Nous 
T 


epa 


2 
2 Eu ( 
gii 
A LAT 
DE) am 


wp sb pus 


I 
a 
i 
i 
t 
alja 
* 
Spe epe nS a 
a 
I asa Ea Oaka 


THE MULTIVARIATE NORMAL DISTRIBUTION §9.4 


and on substituting 


1 ; 
u= tmc (4) 


gii = ot — E 4$j-2,3,---,k (5) 
we have 


kk k k 
DY oyy; = otu? + Y Y oyi 
LE 22 


With this reduction we can integrate out yı. The integral on y; is 


kk 
E 2 41N2 -4D Xeiyvi 
f Sys, ys yi)d = Ji (i) we 11 dy, (6) 
kk 
= /1\i2 jout 4D Datinivs 
= T 
kk 


1\@>? 1 ^ -4D Xeiiuwi 
-(z) " M C 


o 1 9." 
f ae Sap (7) 


in which the integral is one, as follows from the univariate normal dis- 
tribution. Now let us examine the resulting function of ys, * * * ; 


Yk, Say, 
kk 


a-p y YR Dyui 
w-(L Yu. © 


gi 


gUn Ys, "7 


where i" and j' are indices which run from 2 to k. Suppose we denote 
the inverse of |;?|| by lox]; then 
k 
oomi = Ôi (9) 
m=1 
and since ||;?|| is symmetric so is ||o,|, and we may interchange 7 and 
m in c'^ or j and m in om; without invalidating the relation (9). 

We shall show that the inverse of ||s**||, 7’,  — 2,3, * * ^: k, is 
precisely ||ov,||, 2’, 7’ = 2,3, * * + , k, i.e., on omitting the first row and 
column of the inverse of ||c]], we have the inverse of |||. We need 

179 


§9.4 THE MULTIVARIATE NORMAL DISTRIBUTION 


only show that 
k 
Yoweét-dás dj-23 077k (10) 


m-2 


Referring to (5), 


m-2 m= 
k x E. 
3 gi " 
= Y soo — T X oí 017 (11) 
m-2 m-2 


and in view of (9), the first sum on the right of (11) is 8r» — oyo"!, 
while the second sum is 64 — o10!! = —ocijc!! since j’ has the range 
2,3, +++ ,kso that à = 0. The expression (11) is therefore 
? Ww 
ep — oo — ug (Leo!) = diy 

so that (10) is verified. 

The coefficient /|c"|/+/o!! is 4/|5?], as may be seen as follows: c!' 
is the cofactor of ei in |o;| (7,  — 1, 2, - - - , k) divided by |o;|. The 
cofactor is [ej] G^, = 2,3, +++, k). Since [oë] = 1/ss], we have 


NET. EG N 
Voi wee vorl 


and since ||| is the inverse of |[z??||, their determinants are recipro- 
cals and hence 


Ve. ven (12) 


We find then that (8) is 
1 (k—1)/2 
gs Ys tw) = (=) VIT enzz — (13) 


Now suppose y: is integrated out of (13). The preceding argument 
shows that the result will be, say, 


k ok 
, 1 Ve -4È È sivy 
hlys, Ya it) = (3) Mie i=3i=3 (14) 


where |[?|| is the inverse of the matrix obtained by striking out the 
first row and column of |z;;|| or by striking out the first two rows and 
180 


MARGINAL AND CONDITIONAL DISTRIBUTIONS §9.5 


columns of |l; Proceeding in this manner suppose all variables 
but y, have been integrated out; the result will be, say, 


pn) = pe Vs (15) 
and we know what oo is in terms of the original parameters o. ois 
the inverse of the matrix obtained by striking out the first k — 1 rows 
and columns of |lo;;||, but this leaves only one element c;; in the matrix, 
and its inverse is simply 1/ox,. Thus eo = 1/ox. The integral of 
(15) from — © to +œ% is, of course, one, and we have shown that (1) 
does represent a density function. 

9.5. Marginal and Conditional Distributions. The argument in the 
preceding section has supplied us, incidentally, with all the marginal 
distributions associated with the multivariate normal distribution. 
The marginal density for the first r variates, z1, 2s, * * * , Tn is obtained 
by integrating out the remaining k — r variates, and the result may be 


put in the form 
. T r 


—4 D D daglat) 


where the indices a and b take on the values 1, 2, - + -+ ,r. The coeffi- 
cients 2^" of the quadratic form are obtained by striking out the last 
k — r rows and columns of ||o,|| and inverting the result; i.e., 


le | = les? a, B= 1,2, sr Q) 


If one wishes to obtain the marginal distribution of any other subset of 
r variates, he may merely relabel those variates 21, 22, * * * , 2 and 
use the above form; or he may define indices a’, b’ which take on the 
desired values. ‘Thus if one wanted the marginal density of zi, 24, Vs, 
he could put it in the form 


1\% =H D Dae! (war — Ear) v — or) 
(a=) veje av a', b = 1, 4,5 


where 
eu cu cu 
lg?" || = |a oas ossi 


751 C54 T55 


Now let us turn to the conditional distributions. The conditional 


density for the first r variates, for example, is defined hy 
181 


$9.5 THE MULTIVARIATE NORMAL DISTRIBUTION 


EI Pee) (3) 


f(a, G2, --t.2)1ms cttm) = (Gra, coca) 


where g(z.i * * * , zy) is the marginal density of the last k — r 
variates and is 


1\@-n/2 -3 Z Dara xy — 65) a — ta) 
REEL E @ vire 2a (4) 


where pqg =r +1,r +2,- , k, and 
lëre] = lopai (5) 
On dividing (4.1) by (4), (3) becomes: 


1/2 a/f] De viui een) a 

7 Vara 
in which we have let y; = z; — & We shall let 7,7 = 1,2, + * + , k; 
a,b=1,2,:°+,r;andp,g=r+1,r+2,---, k throughout 


the remainder of this section. The conditional density (6) is a density 
for the ya; the y, are constants. We shall show that (6) is a multi- 
variate normal density for the y; and that the regression functions 
(means of the y;) are linear functions of the yp. 

The quadratic form, Yoy:y;, may be put in the form 


Yor vay» + 2 Yoyeyr + Yers (0) 
ab ap Pa 


where the first sum involves the squares and products of the variates 
Ya, the second sum involves only the first powers of the variates, and 
the third does not involve the variates at all. First we eliminate the 
linear terms by substituting 


Za = Ya F Ca (8) 


and properly choosing values of the ca. The substitution changes (7) 
to 


PLE — Ca)(@ — «) + 2 2 s"? (24 — Ca) Yn + » PYY a 
= » o? 242, — 2 l 0^ 2, C) de » occ, + 2 Yes Yo. —' 2 » oC 


RS » PYY a (9) 
Pa 
182 


MARGINAL AND CONDITIONAL DISTRIBUTIONS §9.5 
The second and fourth sums on the right of (9) will cancel if we put 


Za = Yey (10) 
p 


This is a set of r linear equations (for a = 1, 2, +++, r) which will 
determine thec's. We may solve them for the c's easily by employing 
the inverse of ||o||, which we may denote by ||éa||. On multiplying 
(10) by Goa’ and summing on a, we find 


Y suctmy, = Y saccos 
BA "i 


= Y bance 
D 
= ca (11) 
If we define 
dan = Y ad (12) 
b 


then the c’s are the following linear functions of the yp: 
Ca = Y, Corp (13) 
p 


With the substitution of (8) and (11) in (6) the part of the exponent 
in parentheses becomes then : 


Y ena + Y, ote — 2 Y, areas + 2,0" lla — Yo". (14) 

ab ab ap pa pa 
We shall show that the last four sums cancel out. If we substitute 
for the c’s in (14) from (13), the coefficient of Ypy in the last four sums 

of (14) is, say, x 

dpa = by adapti — 2 Y oa dert (15) 

T ab a " 
Tn the first sum a and b are interchanged and Y wo“? substituted for 


as» in accordance with (12). The first sum on the right of (15) 
becomes 


» o guo" Aag = y Saa'O" aq 
aba! aa! 
= 3; g^? Qaa (16) 
and thus cancels half the second term of (15), leaving 
dy, = — Y eaas Poti iis (17) 
i 183 


89.6 THE MULTIVARIATE NORMAL DISTRIBUTION 


This expression is now multiplied by cpp and summed on p after first 
substituting for aag from (12); we find 


v D, apd 2 
\icerdee -— Y o,yro*raucht + J opporsi = Y ppd? z (18) 
D D 


abp n 


= — Y (Uy — Y eese) sac + Y apwo? — b (09) 
ab a D 


a) 


aa'b 
= DEEST Jo X cpp? — Sap" 
a'b p 
ni » Topot Hr » Oppa”! — bay 
b P 
T Y oic — Sap! 


i 


= Sap — Sap = 0 (20) 


04,07" Tapa”? + J oppor — bop" 
» 


The à,, of (19) vanishes because a and p’ have different ranges. Equa- 
tion (20) is now multiplied by a? and summed on p’ to show that the 
d», vanish. 

We have shown, therefore, that the quadratic form of (6) is simply 
the first sum of (14): 


Yet - by c" (ys + ca) (yo + co) (21) 
ri ab 


and hence that the coefficients of the quadratic form in the conditional 
density of the y, are the same as in the original density. Further, the 
regression functions, —c;, are linear functions of the fixed variates Yp. 

9.6. The Moment Generating Function. The joint moment gener- 
ating function for zi, 2», * * + , zy is 


mlir te, © * * Vt) = E(eten) (1) 


E k/2 E 
= f uS h n e Vo e62ze eto teit) [] day. (2) 


Let xi — & = yi. To perform the integration, we again need to com- 
plete the squares on the y’s. We shall merely exhibit the result and 
show that it is correct. Consider the expression 


ZY (We - Yet) (n - Yo) = XY, 
i n DEI 
= D » o? y;osil, — » D » oY 7 mitm + Y 2: y. » oomiOnitmin (3) 
t jin tim £35. n 


i jm 
184 


THE MOMENT GENERATING FUNCTION §9.6 
In the second term we shall sum first on j and use the relation 


X, ction; = bin 


j 
» » 2 ag, iil = 23 Oinyila 
$. «wu $ n 
EZ » yis 


since the sum on n of ôintn = t; because ôin = 0 except when n = i. 
Similarly, the third term in (3) reduces to Syjt; = yi; In the fourth 
term of (3) we sum first on 7 to obtain 


y » » 8 njOnitmbn 


jm on 
and then sum on j to obtain 


» X Onmtmln 
mn 


to obtain 


We have finally 
YY«(n — Font) (u — Jet) = Xetra - 2X 
m n $3 * 
ur Y » c ijlil; 
Sy 


and (2) may be put in the form 
Eun WZZXeutti 


Mh, e yt) =e e? 
-4Y Dolni- Denitn)(yi— Donita) 


NEC v 


The integral here is clearly one, since it is the integral of a multivariate 
normal density with parameters E = Lomitm = Xeno. Hence the 
moment generating function is 


m(t, rit: , tx) = qQXud M2 Zosititi (4) 


On differentiating m with respect to t and then putting all & = 0, we 
find 
E(x.) = & 
and the second derivatives show that 
E(22) = or + & 
E (arts) = Ore + bh 
185 , 


§9.7 THE MULTIVARIATE NORMAL DISTRIBUTION 


remembering that ors = cs. The variances and covariances of the 2; 
are therefore cx and oj; hence the inverse of the matrix ||c*|| of the 
quadratic form of the multivariate normal distribution is in fact the 
matrix of variances and covariances of the distribution. 

As in the case of the bivariate distribution we may define correlations 
pi; between x; and z; by the relations 


i) ee 
uic. 
NV oj; 


and these correlations may be used as parameters instead of the covari- 
ances. It can be shown that if |o*| is positive, as is required by the 
definition of the distribution, all the correlations must lie between —1 
and +1. If all the correlations (or covariances) are zero, then the 
multivariate distribution reduces to the product of k univariate normal 
distributions with variances 1/6. 

9.7. Estimators. If random samples of size n, (tia, Toa, * * * , Zra), 
a —1,2, +++ ,n, are drawn from a k-variate normal population, the 
joint density of the observations is 


ij 


"BV -4D D Xeiltia- t)zia—t) 
(E [oe tie (1) 


and the logarithm of the likelihood is 
nk n "WS 7 
L= — y log 2x + 5 log || — 5) YY ia — 8) (tie — &) (2) 
Pg 


To estimate the parameters £ and c, we solve the equations obtained 
by putting the derivatives of L with respect to these parameters equal 
to zero. Considering first the means, 


aL ley, Lev, 
oh o (ns — &) + 32, 2,76 GF, P» — B) 
i erum 
= XX oF ia b) (3) 
since o = gil, 
And in general for £ we have 

ðL 3 
SET AY of ee 9. r-1,2 :-*-,k (4) 


186 


ESTIMATORS §9.7 


If we. substitute % = (1/n) y Tia in the last expression and equate it 


to zero, we have a set of k equations: 
k 
n Y m- )=0 r-L2-:::,h (5) 
iz 


to be solved for the & On dividing by n, then multiplying by or, 
and summing on 7, we have 


D D ono” (@ — k) =0 
or 
Y 6n(% — &) = 0 


or 
#,— t= 0) =e) 2 PIE 


The estimators £; of the population means £; are therefore the sample 
means, 
§=%= i Y tia (6) 
a 
To estimate the cï, we must differentiate L with respect to each of 
these parameters. We have o? = c?; however it will be simpler to 
regard o! as different from c?. We seek the maximum of L subject to 
the restrictions on the variables, o? = c, but we shall find first the 
maximum of L without observing these restrictions. Certainly the 
unrestricted maximum will be at least as large as the restricted maxi- 
mum. We have 


aL 


dor? 


ii 


cute 
2 ei] 


cofactor of o° — 1 ) (Gra — Ër) (sa — E) 


= on —2Y Gre — Bim = &) = TBE 0 


On putting this expression equal to zero for all pairs (r, s), we have a 
set of k? equations to solve for the ci. The solutions will obviously 
involve £, and we have already solved for those in equation (6). Let 
us now define 


ay = y (Tia — Ti) (Tia — i) (8) 


187 


§9.8 THE MULTIVARIATE NORMAL DISTRIBUTION 


Then (7), after substituting Z; for £;, becomes 


n n 
Wee 9 dr 


On equating this to zero we have 


ó.—0. n$—12-:::,kh (9) 
and if we let ||a?|| be the inverse of ||a;]], we have 
és = gii (10) 


We have located the unrestricted maximum, but it turns out to be 
equivalent to the restricted maximum because it is obvious from (8) 
that a; = aj; hence ¢# = é. Thus the same maximum would have 
been located had we used the restrictions o = c? originally; the only 
point of omitting the restrictions is that it simplifies the differentiation 
of the determinant in (2). 

The maximum likelihood estimators of the means, variances, and 
covariances are therefore 

1 
& FA 5 Tia 


tu = FY (e — blem — £) a1) 


and the estimators of the parameters o are given by the inverse of ||é;||, 
lell = nés (12) 


9.8. Problems 

1. Show that the contour lines for the bivariate normal density 
[i.e., curves for which f(x, y) = constant] are ellipses. 

2. Show that any plane perpendicular to the x, y plane intersects 
the normal surface in a curve of the normal form. 

3. If the exponent of the exponential in a bivariate normal density 
is —76[4(@ + 1)? — 2(@ + 1)(y — 2) + (y — 2)?], what are the means, 
variances, and covariance of the variates? 

4. What is the moment generating function for the distribution 
specified in Prob. 3? 

5. What is the moment generating function for moments about 
the means for the bivariate normal distribution? 


3 1 0 O 
6. Find the inverse of the matrix L Koi 
002 0 
000 4 


à 188 


PROBLEMS §9.8 


Te. Find the variances and covariances of normal variates which have 

the quadratic form 2a} + 2$ + 423 — aye — 2225 in their distribution. 
8. What is the marginal density of xı and zs in Prob. 7? 
9. What is the conditional density of xı and zs in Prob. 7? 

10. If the matrix of Prob. 6 is the matrix ||| of a normal distribu- 
tion of zi, £2, £3, 24, show that the conditional distribution of xı and x2 
is the same as the marginal distribution of xı and zs, hence that the 
pair (x1, 22) is distributed independently of the pair (ws, 4). 

11. Show that the determinant with k rows and columns, 


adb ber ae O 
bao ARES" Ob 
bub by Rees ia] 


which has a’s in the main diagonal and b’s everywhere else, has the 
value 


(a — b) Ya + (k — 1)0] 


Before expanding the determinant, subtract the second row from the 
first, the third from the second, and so on; then add the first column to 
the second, the second to the third, and so on. 

12. Given the sample (2.5, 7.0), (4.0, 9.0), (0.4, 1.7), (1.2, 2.0), 
(0.3, 0.0), (1.5, 3.7) from a normal bivariate population, find the maxi- 
mum-likelihood estimate of the regression function for the conditional 
distribution of xə Plot the sample observations and the regression 
function. 

13. Consider any multivariate density f(nuzs °° * y TH). One can 
define 

The means: & = E(t) 

The variances: ox = Elle: — £)] 

The covariances: cy = El(a: — &)(x; — £l 
Oj 


The correlations: pi; = UE 
[n] 


What is the mean and variance of any linear function y = Zax; of the 
2's? 
14. Referring to Prob. 13, what is the correlation between two linear 
functions y = Dai; and z = Eb (y # k2)? 
189 


§9.8 THE MULTIVARIATE NORMAL DISTRIBUTION 


15. What is the covariance matrix for the multinomial distribution 
[equation (3.5.2)]? 
16. Referring to Prob. 13, the conditional density of the first r z's is 


UST ESSE rj) 


where g represents the marginal density of the remaining variates. 
The conditional distribution has means variances and covariances 
which may be functions of the £r}, - + * , z; and may be denoted by 
gilari ** * , tx) (the regression functions) and ej(vu, * 7 * , Xx) 
where now îi, j = 1,2, - -+ , r. Show that the expected value of the 
regression function &;(v,41, * * * , £4) is the mean of z; under the uncon- 
ditional distribution. 

17. Show that the o;;(x,41, * * * , a) of Prob. 16 are constants for 
the multivariate normal distribution. 

18. Verify the details of the sequence of equations (5.18 to 5.20). 

19. The expected values of the oi(z,41, * * * ,2;) defined in Prob. 16 
are called variances and covariances about the regression functions and 
are usually denoted by 


F(a, AD » etri, ur D Tr) = 


Fyre k = Eloi(trya, + * + ,22)] 


The partial correlation coefficients of the conditional distribution are 
defined by 


x Tij- (r41) ---k 
V Tiit 2 kOji (r1) -e-k 


Pij (r41) +-+ k 


Find pis; in terms of pı, ps, ps, and p, for the multinomial distribution, 
taking the number of classes to be four. 

20. What is o.z for the bivariate normal distribution? 

21. Find the conditional density of x; and zs, given zs, for the tri- 
variate normal distribution, and show that the regression functions are 
linear. (Simplify the algebra by using variates Yi =x; — &. The 
means of yı and y» are (o13/035)y3 and (23/033) ys. 

t 22. Find the variances and covariances about the regression func- 
tions for the conditional distribution of Prob. 21. 
23. Show, for the trivariate normal distribution, that 


P12 — pispos 


24. Let a1, zs, - - - , 25, denote scores on 2k questions in an aptitude 
test. Let the scores be normally distributed, each with the same mean 
190 


piz3 = 


PROBLEMS §9.8 

and variance (u and o°), and such that the correlation between any 
k k 

pair of questions is p > 0 iy = Y asia and ys — X xo; are total 
1 Fh 


scores on the odd and even questions, find the correlation between yı 
and ys and show that it can be made as near unity as one pleases by 
making the test sufficiently long. 

25. Let cis... represent the deviation of xı from its regression 
function in the conditional distribution of z;, given 2», 2s * * * ; Tr 
Show for a trivariate normal distribution that zi £24, ?»21 are inde- 
pendently normally distributed. 

26. Generalize the result of Prob. 25 to k variates. 

27. Let zi $5, * * * , t have the multivariate normal distribution 
and consider the conditional distribution of xı, given the other k — 1 
variates. Let the regression function be denoted by z; the correlation 
between zı and z is called the multiple correlation coefficient of xı on 2 
and is denoted by Ris... Show for a trivariate normal distribu- 
tion that 


01123 = ou(l a Ria) 
28. Referring to Prob. 27, show that 


011.23... k = exul Zy Tias) 
29. Show that 

1 — Ria = a= piz) (1 — Pisa) 
30. Show that 


1 — Ria...& = (1 — pte) (1 — Pisa) (l — pua) coo (1 — pienses an) 


191 


CHAPTER 10 
SAMPLING DISTRIBUTIONS 


10.1. Distributions of Functions of Random Variables. In order to 
study further the problem of estimation, it is necessary to have the 
distributions of the estimators. In this section we shall consider 
methods of obtaining such distributions, and then in the remaining 
sections of the chapter the methods will be employed to obtain certain 
distributions of particular interest. 

A variate x may be transformed by some function of x, say u(x), to 
define a new variate u. We may think of the population over which x 
varies to be changed to a new population over which u varies. A 
sample value zo, for example, drawn from the z population may be 
interpreted as determining an observation uo = u(xo) from the u popu- 
lation. The density of u, say g(u), will be determined by the trans- 
formation u(x) together with the density f(x) of x. 

If z is a discrete variate, the distribution of a function u(x) is deter- 
mined directly by the laws of probability. If æ takes on the values 
0, 1, 2, + + + , 7, for example, with probabilities f(0), f(1), * *  , f(r), 
then the possible values of u, say uo, ux, * * * ; Us, are determined by 
substituting the successive values of z in u(x), which we shall assume 
to be a single-valued function of x. It may be that several values of x 
give rise to the same value of u. The probability that u takes on a 
given value, say u;, is 


g(u) = Z'f(x) (1) 


where the sum, 3’, is taken over all values of x such that u(x) = ui. 
Thus suppose « takes on the values 0, 1, 2, 3, 4, 5 with probabilities 
Po, Dy, Pz, Ps, Pa, ps; the density of u = (x — 2)? is 


9(0) = ps 
g(1) = pı + ps 
g4) = po + p4 
g(9) = ps 


and 0, 1, 4, 9 are all the possible values of u. Similarly if u is afunc- 
tion of several discrete variates zi, 2», + - - , t With a joint density 
192 


pm. 


M "E 


DISTRIBUTIONS OF FUNCTIONS OF RANDOM VARIABLES §10.1 


fm, zs © ©, Ze), the probability that u(zi 2e, * * * , tx) takes on à 
particular one of its values w is 
g(ui) = Z'f(xi 25 0m) (2) 


` where J’ is taken over all sets of values of the 2's such that au(x, vo, 


D, 2) = Ui 
The basic and often the simplest method for finding distributions of 
functions of continuous random variables was given in Prob. 28 of 
Chap. 4. If x has density f(x) and u(z) is a function of z, then the 


Fra. 43. 


cumulative distribution of u is readily found. Let G(u) denote the 
eumulative distribution; then 


G(u) = Plu(a) < ul (3) 
= | fous (4) 
u(x) <u 


in which the integral is taken over that part of the z axis where the 
function u(x) is less than u. If, for example, 


u(x) = z? — 2 (5) 
then 
Gu) = [177 gos = Fu F9) (6) 


Of course the density function may be obtained by differentiating the 
cumulative distribution. 

It will be instructive to consider another approach to this problem 
of finding the distributions of functions of continuous variates. 

We shall first investigate functions of a single random variate a. 
To see how f(x) and u(x) determine g(u), we may consider the situation 
illustrated in Fig. 43, where à particular function u(x) is plotted. We 

193 


§10.1 SAMPLING DISTRIBUTIONS 


wish to determine g(u) at the point marked u on the u axis between the 
horizontal dotted lines. If we solve the equation u = u(x) for x, we 
may obtain one or more values of x; thus in the figure there are three 
values, 21, 2», zs, which correspond to the given value of u. A small 
interval Aw about u determines corresponding intervals Axı, Az», and 
Ax; about the points z; which correspond to u. The function g(u) 
must be such a function that 


P(u lies in Au) = n luau (7) 


where the symbolism on the right means that the integral is to be 
taken over the interval Av. We have already seen (Sec. 4.2) that a 
value wu’ may be found in the interval such that 


fu, 00u = g(u'yau (8) 


Now v will lie in the interval Au provided z lies in any one of the inter- 
vals Azı, Awe, Avs; hence we may state 


P(uin Au) = P(w in Azı) + P(x in Ary) + P(x in Azs) (9) 


and since 


P(e in An) = f. fla)de = fear, (10) 
for a properly chosen value x; in Az;, we have 
g(u')Au = f(xi)Aar + f(a5) As + f(wh)Acs (11) 


From this relation it is clear that g(u) may be determined by dividing 
through by Au and taking the limit as Au — 0. 

The curve u = u(x) may also be represented over Az; by the equa- 
tion x = zi(w) obtained by solving u = u(x) for z. Similarly over 
Ar» the curve may be represented by x(u), and over Avs by 23(u). 
From (11) we have 


3 A Az 

Jim gW’) = lim [re Te T KG 0 + Hed A d) 
and when Au — 0 in such a way as to collapse on u, all the Az; also 
approach zero so that they collapse on the corresponding z; The 
values u^ and z; necessarily approach u and x; since the primed values 
must lie within the corresponding intervals. The ratios Ax;/Au, of 
course, approach the derivatives of the x; when Au approaches zero. 
It follows then that 


atu) = fles) + flan) Z + play) dr 


194 


DISTRIBUTIONS OF FUNCTIONS OF RANDOM VARIABLES §10.1 


except that one revision is required. Some of the derivatives may be 
negative; thus at xə in the figure u decreases with increasing z hence 
dxs/du is negative. We are, however, interested in the positive areas 
in (9), and for this reason we must change the signs of any negative 
derivatives. We shall use a subscript + to indicate that a quantity 
is to have its sign changed if it is negative. We shall write, therefore, 


qQ) = feo FE + fey TE + fed Fe (13) 


and since we shall want g(u) to be a function of u instead of the z;, we 
shall substitute the functions z;(u) for the 2; in this relation. 


Fia. 44. 


To illustrate the above ideas, we may consider the variate w with 
density 
f(z) =3(@@ +1)  -1«2«2 (14) 


and transform z to u by the relation u = x°. The function wis plotted 
in Fig. 44. The range of u is clearly 0 < u < 4. Ifu <1, there are 
two values of x which correspond to each value of u; we may designate 


them by 

nu =- vu «<0 t 

zu) = St z»0 (15) 
For u > 1, there is only one corresponding value of z, namely, 


z(u) = Vu 


195 


810.1 SAMPLING DISTRIBUTIONS 


We must therefore define the distribution of u in two parts. If 
0 <u <1, we have by (13): 


a(ci) = leiu) +1) 2 a 4 2 riu) + 1) S22 oe 


T 
=5(- VD a o ce 
2 S 
BRA (16) 
while if 1 < u < 4, 
2 dx 
hea 
2 (Viu + 1 (17) 


ne 


The general procedure is now clear. To find the distribution of any 
function u(x) of a random variable x, we find, for every u, all the points 
x: such that u(r;) = u, and express the 2; as functions of u, say «;(u). 
The density of u is 


atu) = Y fest) qe (18) 


where f(x) is the density of x. Often we shall deal with monotone 
functions u(x), functions which are single-valued and such that x(u) 
is also single-valued. In this case the sum in (18) would consist of 
only one term, and we have 


g(u) = fle(u)) E (19) 


for monotone functions u(x). 
When v is a function of several random variables, the distribution 
of u may be obtained as a marginal distribution. Suppose zi, 25, 
* , tx have a density f(zi, 2s, - - - , z,) and the Hensity of u(a1, 22; 
* ++, a) is required. We may eliminate one of the 2’ s, say c, in 
terms of u by solving the equation 


Ulti, Ta 1,2) =U 


for xı to obtain a function giu, 3, Ta, * * * , Xx), or several such func- 
tions zi;(u, z», - + + , 2) if wis not a monotone function of z. Using 


a similar argument to that used to obtain (18), we may obtain a 
density 


196 


DISTRIBUTIONS OF FUNCTIONS OF RANDOM VARIABLES §10.1 


oci 


jn, 29) 


glu, v2, °° * 23) = X seule, $5 * ** , Te) 29 °° y Ta) 


i 
and the density of u may then be found by integrating out Ta, ta, * * * , 
a, in g. We.shall illustrate this method in the next section where 
we shall find the distribution of c 


n 
ih - 
ummy ttm) =>). a =F 


for-samples from a normal population. 
The procedure described above may be generalized to determine 


the joint distribution of several functions ui(zs, * * * yz), Mah reos 
ay), © ©, U(E rr 2i) (r € k) of k random variables. We may 
put 
Ulty ctt, Te) = Ua 
uu cct, te) us (21) 
Ur(ar, coca Tk) = Ue 
and solve the resulting set of equations for zi, 2», * * * , tr to obtain 
a set of r functions z;(ui U2, * * * , Urn rey * * "> 2,), or if the solu- 


tion is not unique, we may have several such sets of r functions. The 
joint density of the u’s and the remaining x's can be shown to be 
dx} 


ox, 22) 


Qu, ta, ^os Beary © + tsm) = fere Bey = >» BH) 


where the sum is taken over all sets of solutions of (21) and where 
[0:/8u;|.. is the positive value of the determinant of the partial deriva- 
tives of a;(ui, * * * , Ur Xu, c * * , Tk) With respect to the w(i, j = 1, 
2,+++, 7). We omit the proof of (22); it is essentially the same 
as the derivation of the formulas for transforming variables in multiple 
integrals, which may be found in any textbook on advanced calculus. 
Use of Moment Generating Functions. There is a second method of 
determining distributions of functions of random variables which we 
shall find to be particularly useful. If u(t, 2, `` * , tx) is a function 
of random variables 2; which are distributed by f(x zs, * * * ; $4), We 
may find the moment generating function of u 
m(t) = E(e") 
= fies fernna f(t ta t, 2,) Ida; (23) 
If the resulting function of ¢ can be recognized as the moment generat- 
ing function of some known distribution, then it will follow that u has 


that distribution by virtue of the theorem given at the end of Chap. 5. 
197 


§10.2 SAMPLING DISTRIBUTIONS 


This method is quite powerful in connection with certain tech- 
niques of advanced mathematics (the theory of Laplace transforms 
and Fourier transforms) which enable one to determine the distribu- 
tion associated with any given moment generating function. The 
method can also be generalized to determine the joint distribution of 
several functions of random variables. 

10.2. Distribution of the Sample Mean for Normal Populations. 
Tf samples (zi, 22, * * * , Xn) of size n are drawn from a normal popula- 
tion, the joint density function for the observations is 


1 
fj25 °° * 5%) = II e Vili) re 
Ane 


1 n/2 1 
= (4) = em XGi-a)/e] (1) 
T, g 
and if the variates are transformed to 
ti — 
Y= z 
Co 
the density becomes 
1 n/2 
Mw y) = (2) come @) 


in accordance with equation (1.22) with r = k = n, since |d2;/dy;| is a 
diagonal determinant with elements « in the main diagonal positions 
and zeros elsewhere. The value of the determinant is readily seen to 
be c^. 

To find the distribution of 7, we eliminate yı from (2) by the substi- 
tution 


u-n-Xw-w Ya, Y) (3) 
2 


and obtain the density 
n n 


$ 1\e “Mni Ey Duel 
9G, Ya; Ya, ciu) = (zy ne 2 2 (4) 


in accordance with (1.14) since dy:/dg = n. We now wish to find the 
marginal distribution of g. The density in (4) may be regarded as a 
multivariate normal distribution of g, ys, - - - , Yn, and examination 
of the exponent shows that 

198 


THE CHI-SQUARE DISTRIBUTION §10.3 


n NE EOE N: mN] 

CEST 2 1 1 1 

—n 1 2 1 1 

* — 1 1 2 1 
let = fo ll ; 6) 

=n 1 1 1 eens 2 
The determinant |o?| must necessarily be n?, since in (4) it is seen that 
Vie] ^ n. We have seen in Sec. 9.5 that the marginal distribution 
of one of a set of normally distributed variates is a normal distribution 


with the same variance that the variate has in the joint distribution. 
We need therefore to find 11, which is obtained by dividing the cofactor 
of c?! by |c?|. The elements of the cofactor are obtained by striking 
out the first row and column of (5), and the determinant of the result- 
ing array is easily found to be n. Hence 


n il 
Ui c. mist 


The density of 7 is therefore 


Vn er (6) 


Since 


— (7) 


we may transform (6) by (7) to obtain the density of 2, 


n(%) = wal vn g- mee] (8) 


Vin € 
by equation (1.13) since dg/dZ = 1/e. 

The distribution (8) is the distribution approached by the distribu- 
tion of z for any population with finite variance as n becomes large, as 
we have seen in Sec. 7.6. We have shown here that the distribution 
is exactly the distribution of the sample mean for normal populations 


whether or not the sample size is large. SA 
10.3. The Chi-square Distribution. We shall obtain the distribu- 


tion of 
V Y ai — uy. (1) 
u= vos 


199 


810.3 SAMPLING DISTRIBUTIONS 


where the 2; are normally and independently distributed with means 
m and variances c2. In the joint distribution of the z; we again trans- 
form the variates to 

T; — Mi 

I 


fj = 


in order to simplify the equations; u is then simply >y?. The method 
of moment generating functions will be employed to obtain its dis- 
tribution. 

The moment generating function of u is 


m(t) = oe Í f 2 Í eiue e às (2) 


and the multiple integral may be written as the product of k integrals 
‘of the form 


1 a 
eau? q. " 3 
Vas fa V. (3) 
The integral (3) has the value 1/4/1 — 2t since multiplication of the 
integral by 4/1 — 2t makes it represent the area under a normal curve 
with variance 1/(1 — 20). It follows that 


1 k/2 1 
m(t) = (a) t<5 (4) 


The moment generating function is of the form of the moment generat- 
ing function for a gamma distribution (Sec. 6.3) with a = (k/2) — 1 
and 8 = 2. We may conclude therefore that the density of u is 
1 1 
eS Seay te u 
f(u) ey i ze: 776 u>0 (5) 
This particular form of the gamma distribution is usually referred to 
as chi-square distribution with k degrees of freedom. The variate u is 
commonly designated by the square of the greek letter chi, 
k 
2 
2 = Ce 6 
dar (4) © 
hence the name for this distribution. The phrase degrees of freedom 
refers to the number of independent squares in the sum in (6); we may 
think of it, however, as merely a name for the parameter k in the 
density (5). ? 
We may notice here that (5) gives essentially the distribution of the 
200 


INDEPENDENCE OF MEAN AND VARIANCE §10.4 


maximum-likelihood estimator for o? in normal populations when 
uis known. If one considers samples of size n from a normal popula- 
tion with known mean y, the maximum-likelihood estimator for c? is 
found to be 


1 c? 
42 =- SS 2 = 
EE u) s 


where u = » [G; — u)/c]? has the chi-square distribution with n 
degrees of freedom. The density for the estimator is therefore 


42) = 1 i ^ 2(n/2)—195—n8?/2e* 
1) = acne) eme E 


Since 


This is a gamma density with a = (n/2) — 1 and 8 = 26c?/n. 

The chi-square distribution is partially tabulated in Table III; the 
most complete tabulation is Karl Pearson's “Tables of the Incom- 
plete Gamma Function" (Cambridge University Press, London, 1922). 

10.4. Independence of the Sample Mean and Variance for Normal 
Populations. Ordinarily the mean of a population is unknown, and 
we are rather more interested in the estimator (1/n) Z(z; — 2)? for a? 
than in the estimator (1/n)2(a: — u)? considered in the preceding 
section. We shall now derive the distribution of this estimator and 
show incidentally that it is distributed independently of the sample 
mean. 

We shall let 


eee 1 
yk (1) 
ed AG P 
w= ng = 5 (JH) 2 
n 
= » (y; — 9)? (3) 
H 
and find the joint moment generating function for v and v, say, 
m(t, t) = E(gheten) (4) 


n 
- f f es Í (X) etm ewnrtnzc pau? [T dy 
2r, 1 
1y = 
2 Í f shee if (i) eizue- erum 2220-94 T dy (8) 
2r, 1 


201 


§10.4 SAMPLING DISTRIBUTIONS 


The quadratic form may be written 
Far (Sn) - 2%) u- 9 
-Ya-T(Ywy-.Ys- up 0 


(1 — 2) Y 90 - mel G ux) 


= ZXZeyy; (7) 
where 
Ir, 2(&5 — t 
"EPI 
CQ cO SEL E 


n 


A determinant of order n with a's in the main diagonal and b’s else- 
where has the value 
(a — b)" la + (n — 1)b] 
Hence 
n—l 
bä] = [: Se 4) = 2) Ae 2] 


n 
£ Tone = = M ot 2 i) | 
= (1 — 25) — 24) (8) 


From:the multivariate normal distribution it follows that 


hence the integral in (5) has the value 


5 1 3^ 1 (n—10/2 10) 
men Y= ( = Ti, ( = u) ( 


The fact that the joint-moment generating function factors into a 
function of £; alone and a function of t; alone implies that u and v are 
independently distributed. We shall not prove this rigorously but 
merely indicate the argument. Similar reasoning to that employed in 
Bec. 5.4 will show that if two distributions of several variates have the 
same joint-moment generating function, then the two distributions 
are the same. We have a density, say f(u, v), with joint-moment 

202 


INDEPENDENCE OF MEAN AND VARIANCE §10.4 


generating function (10). Given the marginal distributions fi(u) and 
fs(v), we may form the bivariate function 


glu, v) = filw)fa(r) « (11) 


which is clearly a density function. Furthermore its moment generat- 
ing function must be 
m(t;, 0)m(0, te) (12) 
where 
m(h, ta) = f feet f(u, v)du dv (13) 


Since (12) and (13) are identical by (10), it follows that g(w, v) and 
f(u, v) are the same density and hence that f(u, v) is equal to the prod- 
uct of its marginal densities. 

The two factors of equation (10) are each of the form of the moment 
generating function for a chi-square distribution; hence it follows that 
u and v are each independently distributed by chi-square distributions, 
the first having one degree of freedom, and the second n — 1 degrees 
of freedom, The fact that u = ng? is distributed as chi square with 
one degree of freedom is in accord with the results of Secs. 2 and 3. 
For we have seen that j is normally distributed with zero mean and 
variance 1/n, and from the result of Sec. 3 with k = 1 it follows that 


vu =O ap =a (E) (14) 


must have the chi-square distribution with one degree of freedom. 


The function 
n n A2 
qz;— E 
v»—)(uy—)- ( - ) (15) 
À 2 xi 


has the distribution given by equation (3.5) with X replaced by n — 1 
instead of n, as would be the case if the deviations were measured from 
the population mean. It is sometimes said that one degree of freedom 
is lost by taking the sum of squares of deviations from the sample 
mean rather than the population mean, or that one degree of freedom 
is used up in estimating the mean. While v in equation (15) is the sum 
of n squares, the squares are not all functionally independent. The 
relation Xy; = nj enables one to compute any one of the deviations 
Yi — J, given the other n — 1 of them. 
In terms of v of (15), the estimator 


p-lye-sm (16) 
208 


§10.5 SAMPLING DISTRIBUTIONS 


has the value 
jae! 
fU 


The density for this estimator is therefore: 


ü n (n—1)/2 
f(@) = [a — 3y9ri (za) 6:2) (n—8)/¢—(n82/202) (17) 


All the results of this section apply only to normal populations. It 
can be proved that for no other distributions are (1) the sample mean 
and sample variance independently distributed, or (2) the sample mean 
exactly normally distributed, or (3) the sum of squares of deviations, 
from either the population or sample mean, exactly distributed by the 
chi-square law. 

10.5. The F Distribution. A distribution which we shall later find 
to be of considerable practical interest is that of the ratio of two quan- 
tities independently distributed by chi-square laws. Suppose u and v 
are independently distributed by chi-square distributions with m and n 
degrees of freedom, respectively. Their joint density is, by (3.5), 


1 


1059 Tm = 2)2]00.— Daan e twee 0) 
We shall find the distribution of the quantity 
_ wm _ nu 
= yin m 2) 


which is sometimes referred to as the variance ratio. We shall find the 
density of F by eliminating u in terms of F in (1) and then integrating 
out v from the resulting density. Since 


ðu mo 


Saure (3) 


and since F is a monotonic function of u, the joint density of F and v is, 
Say, 


qt,» =] z 


(m — 2)/2]I(n — 2)/2]!12@+2 


pN 0-272 
seco (mu) e Mikro] E (4) 


n n 
To integrate out », we must evaluate the integral 


Hr > (nen D Pho M Qi) dy (5) 
204 


i 
a 


* 
% 


THE F DISTRIBUTION $10.5 


of the factors in (4) which involve v. We observe that the integrand 
is, apart from certain constants, the integral of a gamma density over 
its whole range. In fact, if the integral were multiplied by 


USD. + (Perm? 
[(m + n — 2)/2]! (6 


it would be exactly the area under the gamma density with 


ME CRAS 


and 8 = ; , and would have the value one. Hence the 


1 
$[1 + (mF/n)] 
_ value of (5) is the reciprocal of the expression (6). The density of F 
is therefore 
AQ) = f,” oF, d 
-2 fo m+n- mF" 
t i mp s RF 
(^53) (523) — n 0 dv 
!( —~— }!2 ? 
2 2 
m+n = 2\, m-2 


E c mg. P? F0 (7) 
(z — 2 (z: = ze) ( d: CURE 


28/3 n. 


a function with two parameters m and n. These parameters are also 
called degrees of freedom; thus (7) is called the F density with m and n 
degrees of freedom; the number of degrees of freedom of the variate u 
in the numerator of F is always quoted first. 

Five points on the upper tail of the cumulative distribution of are 
given in Table V. More complete tables may be found in the refer- 
ence cited in the footnote to Table V and in Fisher and Yates, “Sta- 
tistical Tables” (Oliver & Boyd, Ltd., Edinburgh and London, 1938). 
The reciprocals of the numbers in Table V provide five points on the 
lower tail of the cumulative distribution. To evaluate in general an 

- integral of the form 


Pa<F<Db= fe h(F)aF 


to the beta distribution and use 


one may transform the distribution $ 
” (Cambridge 


Karl Pearson’s “Tables of the Incomplete Beta Function 
205 


§10.6 SAMPLING DISTRIBUTIONS 


University Press, London, 1932). The required transformation is 


mF/n 
1 + (mF/n) 


which changes (7) to a beta density with parameters a = (m — 2)/2 
and 8 = (n — 2)/2. 

10.6. *Student's? t Distribution. Another distribution of consider- 
able practical importance is that of the ratio of a normally distributed 
variate to the square root of a variate independently distributed by 
the chi-square distribution. More precisely, if v is normally distrib- 
uted with mean p and variance o°, if u has the chi-square distribution 
with k degrees of freedom, and if z and w are independently distributed, 
we seek the distribution of 


(8) 


w= 


_ @&—»/s 
1 
one (1) 
and letting - 
Aga 
y x g 
t becomes VY The joint density of y and v is 


f(y, v) = -i ʻi i JA pau 4—2)/26-1u (2) 
and we find the distribution of t by the same procedure as was used in 
the preceding section. We substitute for y in terms of t (y = t V u/k) 
in (2) and then integrate out u from the resulting function. The final 
result is 
AQ ESDA 
Vir [k — 2)/2]t LL + eem 


a distribution with one parameter k, which is also referred to as the 
number of degrees of freedom of the distribution. Since [(z — 1)/s]* 
has the chi-square distribution with one degree of freedom, it is evident 
from (1) that 1? has the F distribution with one and k degrees of free- 
dom. The cumulative form of the distribution is partially tabulated 
in Table IV. 

10.7. Distribution of Sample Means for Binomial and Poisson 
Populations. In the preceding sections we have illustrated the two 
methods of finding distributions of functions of continuous random 
variables described in the first section. Here we shall illustrate 
the technique for discrete variates in two cases of particular interest. 

206 


Zo Li <o (3) 


e” 


DISTRIBUTION OF SAMPLE MEANS $10.7 


Iz, 25 + * + , t, is a sample of size n from the binomial population 
which has density 
f@)=pq= «=0,1 (1) 
the joint density of the 2’s is simply 
fa, Ey °° te) = Deg 0m = 0,1 (2) 


The sample mean is 


a 


-13a 


a function of the random variates, and it is evident that the only pos- 
sible values of Z are 0, 1/n, 2/n, ** - , 1. The probability, g(j/n), 
that Z takes on the value j/h is obtained by summing (2) over all sets 
(vi, 29, °° * , Za) such that (1/n) 2a; = j/n, or such that Za; = j. 
For all such sets, f(x1, ze, * * * , tn) has the same value pq"; hence 
the sum may be evaluated by multiplying this value by the number of 
sets (zi, 26, * - * , tn) with the required specification. The number 
of such sets is the number of arrangements of j ones and n — j zeros, 


which is (5) hence 


DW LEE x ER. T 
(Q-Que deii o 


as we have found already in Sec. 7.7. 
In a similar manner we may find the distribution of the mean of a 
sample, xi, 22, * * - , tm from a Poisson population. The joint 

density of the observations is 
guy? 


f( 21, ta, * e Za) = x = 0,1, 2, *** (4) 
[I z:! 


using y for the parameter of the distribution. The sample mean Z can 
obviously have any of the values j/n where j = 0, 1,2, °°" - For a 
particular value jn, the z's must be such that Da; = j; hence 


Aa ya 
ONS) — 22m 
Ze-j 
1 
EE UTI aH 5 
= ey ) ita (5) 
Zziej n 
207 


§10.8 SAMPLING DISTRIBUTIONS 


The sum can be performed with the aid of the multinomial theorem 
which, on putting all z; = 1 in equation (2.5.2), states that 


2216 = 2 
Iz;! 
The sum is therefore ni/j!, and the required density is ' 
QEON. rum lid eris 
o(2) - J! viv n (6) 
The function may be written explicitly as a function of z: 
2» cem) = 
oE) = o (7) 


We may notice that since there is a unique correspondence between 
Ē = j/n andj = ai, the density of j is 


any i 
Gj) = TR joong 
and hence that the sum of n observations from a Poisson population 
has a Poisson distribution with the parameter equal to n times the 
parameter of the original distribution. 

10.8. Large-sample Distribution of Maximum-likelihood Estima- 
tors. We have investigated several special problems in sampling 
theory not only to illustrate the methods of finding sampling distribu- 
tions, but because the particular distributions we have obtained are 
important in applied statistics. They are sometimes referred to as 
“small-sample distributions,” though of course they hold for large or 
small samples and the term is merely meant to indicate that they are 
valid for small samples. In this section we shall consider a distribu- 
tion much more general, in the sense that it is more or less independent 
of the form of the population distribution, but valid only for large 
samples. 

We shall first consider the case of one parameter @ in a density 
L(x; 8), and we shall show that the maximum-likelihood estimator 
Ü(vi, v», * * * , tn) for 0 from samples of size n is approximately 
normally distributed under rather general conditions where n is large. 
Before doing so, it is necessary to consider the variate 


ula) = Flow f(e; 0) a) 
208 


"" 


LARGE-SAMPLE DISTRIBUTION OF ESTIMATORS 910.0 


The expected value of u is 
E(u) = [ E log f(r; o re; odz (2) 


= | Sie: nae 6) 


If f(x; 0) is such that the operations of differentiation and integration 
may be interchanged, then 


wy = 2, [ 7 fes ods 
ð 
-ġ0=0 (4 


Hence, if this condition is satisfied, the variance of v is 


a= feres o | fe; nae (5) 
= { rex) & © 


and this may be put in another form which is more useful for our 
purpose. On differentiating (2) with respect to 0, we have 


0 = 2 f (Slo fé: 0) tes nae 
-f (5 log f(a; D) fe; jde + Í (arere: DEL 3/69 4, 
- f (onse: o) ei nde [oc Mar” Ya 


The integral in (6) is therefore minus the first integral in (7), and we 
may write 


o = -E (Slog 0:9) ® 


Now suppose a sample of size n is drawn from the population. This 
will give rise to a sample of u values: 


E fen) det» 0G) 
209 


§10.8 SAMPLING DISTRIBUTIONS 


Applying the central-limit theorem (Sec. 7.6) to the sample of ws, we 
may state that 
1 
AE» 


is approximately normally distributed for large n with zero mean and 
variance o2/n. Remembering that the likelihood of the sample of 
«’s is If(v:; 0) and that its logarithm (Sec. 8.4) is 


gm 


L = Z log f(x; 0) (10) 
we have 
ROR Ri 
ü= a0: (11) 


Hence it follows that 01/90 is approximately normally distributed for 
large n with mean zero and variance na. 

This last result enables us to find the distribution of the estimator 6. 
We shall suppose that 6 is a root of 


aL 
3; 79 (12) 


le, that/L actually has zero slope at its maximum value. And we 
shall suppose that 9L(0)/80 as a function of 6 may be expanded in a 
Taylor series about 6: 


EP = HO y PO. o NO. oa 


where 6 is some point between 6 and 6. Since 6 is a root of 9L(0)/90, 
(13) vanishes, and we have 


oL(0) 9?L(0) 3 9*L(B) 
2 cS C50) Are (6 — 8): (14) 
Now we have seen that 
1 OL(9) 
Mine, 0 


is approximately normally distributed for large n with zero mean and 
unit variance. Using (14), this expression is 


1 9L(0) _ 1 æL(0) À —9'L() 
Mne, 00 Wine, 30? Vn e, 21 905 oun (15) 


and on the right we shall substitute w = Vin oul — 8) to get 


1 ab) — 1 l9'L(]| w* [196*L(9) (16) 
Vino. 88 oln oe Vn ci |n 21905 
210 


(6 — e) 


LARGE-SAMPLE DISTRIBUTION OF ESTIMATORS §10.8 


The first bracket on the right of (16) is simply an average for samples 
of size n of ls log f(a; 0) and by virtue of (8) has a mean value oj. 


Furthermore, if this quantity has a finite variance, (1/n)(9?L/80?) will 
approach c? with probability approaching one as n becomes infinite. 
The first term of (16) is therefore nearly w for large m. The second 
term of (16) approaches zero because of the factor 1/+/n if we assume 
that the average of the third derivative of log f(z; 6) cannot become 
infinite for any possible value of §. The right of (16) is therefore 
approximately w, and since the left of (16) is approximately normal 
with zero mean and unit variance, it follows that w has approximately 
the same distribution. We have finally that 6 is, for large samples, 
approximately normally distributed with mean 6 (the true parameter 
value) and variance 1/no%, where c is defined by (8). The mean 6 
will be the exact mean of 6 for any sample size only if Ê is an unbiased 
estimator. In general, we have seen that maximum-likelihood esti- 
mators are not unbiased so that @ is the large-sample mean, i.e., the 
value approached by the mean as 7 becomes large. Similarly,1/noi 
may be the exact variance, or it may be only the limiting form of the 
exact variance as n becomes large, the large-sample variance. One 
could, of course, compute the variance of 6 directly by 


E — EQ =f - - - S — EOP (ei; 0) 


rather than by means of equation (8), but this is usually the more 
difficult computation. 

The above argument is not, of course, a proof of the asymptotic 
normality of 6; we have merely outlined the nature of the proof. A 
rigorous demonstration requires careful evaluation of the errors in the 
various approximations. While the maximum-likelihood estimator is 
approximately normally distributed for large samples under rather 
general conditions, it is to be remarked that several conditions on the 
original distribution must be fulfilled: 


(1) It must be permissible to interchange the operations of integra- 
tion with respect to z and differentiation with respect to 6. 


(2) The expected value of E log f(x; 0) must be zero. 


(3) 25 log f(x; 0) must have finite mean and variance. 
(4) if L(0) must remain bounded for all possible values of 6. 


(5) The derivative of L(0) must vanish at its maximum. 
211 


§10.9 SAMPLING DISTRIBUTIONS 


These conditions will not be fulfilled, for example, if the parameter is 
the range or a function of the range, for then (1) is not satisfied. We 
have seen in particular that if 0 is the range of a rectangular distribu- 
tion, condition (5) is not fulfilled. 

For a wide class of distributions, however, the maximum-likelihood 
estimator is approximately normally distributed about the true param- 
eter value as a mean for large samples. This is a powerful tool for 
solving many important problems of applied statistics as we shall see 
in the following chapters. The theorem is applicable to discrete as 
well as to continuous distributions. The only change in the reasoning 
for diserete distributions would be replacement of the integral signs 
by summation signs. 

A straightforward extension of the argument will provide an analo- 
gous result for the large-sample distribution of several parameters. 
We shall merely state the result: 

The maxtmum-likelihood estimators 61, à», - - - , 9x for the parameters 
of a density f(a; 6, bs, ©- ; %) from samples of size n are, for large 
samples, approximately distributed by the multivariate normal distribu- 
tion with means 61, bs, - + - , 0 and with coefficients ||no*|| in the quad- 
ratic form, where 


3 9? 
4 = — D : Sess 
o z| tenes o 60, ; w| (17) 
The variances and covariances of the estimators are ||(1/n)o;;||, ‘where 
leall = lë: (18) 


The conditions under which this theorem is true are essentially the 
same as those given in the case of one parameter. 

The theorems obviously depend in no way on the fact that we have 
used univariate distributions. The variate z in all the statements of 
this section may be replaced by a set of variates (z, y, 2, ** S): 

10.9. Applications of the Large-sample Theory. To illustrate the 
use of the theorem just given, we may find the large-sample distribu- 
tion for the estimators of the two parameters of the normal distribu- 
tion. We shall write it in the form 


1 


V 2709 


For samples of size n we have seen that the maximum-likelihood 
estimators are 


Fle; 01, 02) = e—(1/2%1) (@—0,)2 (1) 


212 


APPLICATIONS OF THE LARGE-SAMPLE THEORY §10.9 


6, = lys (2) 
a= LY @— ba? @) 


In accordance with the theorem, these estimators will be approxi- 
mately normally distributed for large samples with means 6; and 6» 
and coefficients no? in the quadratic form, where 


ie eg OORT: 
are «(8 f) (4) 
Since : 
1 1 1 2 
log f = —5 log 2r — 5 log 82 — gg. (0 — 01) (5) 


the required derivatives are 


gula e ceat 
900 | kh 

óflgf rth 

00; 002 — 0 

Pog fi Mr Er 01)? 
a ^ 26 & 


and because 
E(x) = 861 E(x — 01? = 4 


the c*! are readily seen to be 
1 
ba g 
le*l = 1 (6) 
263 
The large-sample distribution of the estimators is, therefore, say, 


nf (1—0)? , (O2—02)? 
n A 9: +e] 


a 1 
61, 62) = — (7) 
qs 09 = 3. VA 
with large-sample variances and covariances given by 
ip deo 
fecal l= 8 
n^i 20). A 
n 


§10.9 SAMPLING DISTRIBUTIONS 


Since o12 = 0, the estimators are shown to be independently distributed 
for large samples; we have.already seen, of course, in Sec. 4 that they 
are actually independent for any sample size. The large-sample dis- 
tribution of 6; is exactly the normal distribution as given in (7). But 
the exact distribution of 62 is given by the gamma distribution for any 
sample size and this appears to conflict with the normal distribution 
indicated in (7). However, it can be shown that the exact distribution 
of 62 does approach the normal form 


= n (2—01)? 
1 TM It 


Vr N2 0. 


as n becomes large (see Prob. 38, Chap. 6). 

As a second illustration, we shall obtain the large-sample distribution 
of the estimators of the parameters of a multinomial distribution. 

Suppose the elements of a population may be classified into k + 1 
categories, say Ay, As, * * * , Axis. We shall describe an element by 
the set of variables (zi zs, * * * , 2543) where, if the element belongs 
to A; ti = 1 and all the other 2’s are zero. If the probability is Di 
that an element drawn at random belongs to A;, then the joint density 
of the z's is 


Fln B+ e o, Tes) = pPDP c pP 2 —0,1;Xu =1 (9) 
where Zp; = 1. Summing f(a, =>- , X41) over all possible sets of 
2’s, namely, (1, 0,0, - - - , 0), (0, 1,0,0, --- , 0), (0,0,1,0,---, 
0), and so on, we have 

kl 
Y fees, t3, Xu) = Xi nod 
iz 


The distribution (9) is a multivariate distribution with k functionally 
independent parameters; we shall take them to be Popa * * * , peand 
think of pz41 as a symbol for 1 — Pi P2e— +++ — py. 

Let a sample of size n be drawn, and let n; be the number of sample 
elements in A;; then Zn; = n and the likelihood of the sample is 


k+1 
Il pe 


ici 
the logarithm of which is 
ki 
L(p,ps >+, p) = Y nj log pi (10) 


isl 
214 


APPLICATIONS OF THE LARGE-SAMPLE THEORY $10.9 


The estimators are found by putting the first derivatives of L equal to 
zero and solving for the parameters. The equations are 


óp Pi Paa (11) 
OL Tae EC E 
Op: Pz Pren 


oL nı Meyer _ 0 


and so on, remembering that pii represents l= pp: 
—p,. On multiplying the first equation by prpry1, the second by 
P:Pr+1, and so on, and adding the results, one finds fj41 = Nr+ı/N, and 
then that 


p= i= 12,2 25k (12) 


We wish to find the approximate distribution of the estimators in 
(12) for large samples. Applying the theorem of the preceding sec- 
tion, we know that the distribution is normal and that the means are pi. 
We need only to find the coefficients no of the quadratic form. By 
equation (8.17) 


4 9? 
ios e ap; ioe) a 
Differentiating log f, we have 
9? Ue Du E 
—"_log f = —- = ifi £j 
Op; 9p; ef Da (14) 
Ti Ue sp . 
--———— ifi = 
P? Pha 4 
and taking expected values, 
ki 
Bw) = Ya: I pF = vs (15) 
1 
kl 


E(t) = DE II pF = 22a 
1 


Thus 
a itizj 
Pk+1 
EN 5 (16) 
Pi pea 
and we may write these two relations as one using the symbol ài; 
gode deo peg nk (17) 
pi Dra 


§10.10 SAMPLING DISTRIBUTIONS 


k+1 
The value of the determinant |c?| ean be shown to be 1 / II pi; hence 
1 


the approximate large-sample distribution of the estimators ‘is, say, 


sa jas 2t 
g(Dufs * > + , Bx) - (4) ee i11 


k k 
li E -AD D a(i Lco» 


(18) 
The inverse of |;?|| has elements 


oun = pi(l — pi) NN Au (19) 

Vu ODD MEUS g . 43-9 Bec 
as may be verified by computing the product |c?| -le;]. ‘The large- 
sample variances and covariances of the estimators are therefore given 
by multiplying (19) by 1/n. "These happen to be, in fact, the exact 
variances and covariances for any sample size 


10.10. Problems 


“1. Apply the method of equation (1.4) to the example treated in 
equations (1.14) to (1.17). 

2. If z is distributed by f(x) = 2,0 < x < 1, find the distribution 
of u = (8a — 1)*. 

3. If x is distributed by f(x) = 1,0 <2 < 1, find the distribution 
of & for samples zi, zs of size two. Observe that the range of xs for 
fixed Z is 0 <a. < 27 when Z < 14, and 27 — 1 < za <1 when 
E>. 

4. If x is normally distributed with mean p and variance c?, show 
by transforming the variate that u = [(z — u)/c]? has the chi-square 
distribution with one degree of freedom. 

5. Obtain the distribution of the mean of a sample of size n from a 
normal population by using the moment generating function. 

6. If xi, x3, x3, * * > - x? are independently distributed by chi- 
square laws with ni, me, - © , m; degrees of freedom, respectively, 
show by means of the moment generating function that u = Dx? has 
the chi-square distribution with n = =n; degrees of freedom. 

7. Using an argument similar to that given for the derivation of the 
chi-square distribution and the fact that |(1 — 2t)oti] = (1 — 220)*|c*|, 
show that the quadratic form of a k-variate normal distribution has the 
chi-square distribution with k degrees of freedom. 

216 


PROBLEMS §10.10 


8. Find the mean and variance of a chi-square variate with k 
degrees of freedom. d 
9. Use the integral of the F distribution over the whole range to 
obtain an identity in the parameters m and n, and then use the identity 
to obtain the mean and variance of F. 
10. Find the .95 probability level of F for two and four degrees of 
freedom by direct integration of the distribution function. 
11. Show that the transformation 


_  mF n 
~ 1+ (mF/n) 


changes the F distribution to the beta distribution. 

12. Show, by transforming the variate in the ¢ distribution, that 
u = t? has the F distribution. 

13. If zi, 25, * * * , 2, isa random sample from a normal population, 
show that 


w 


i—u 
phim a 
n(n — 1) 
has the ¢ distribution with n — 1 degrees of freedom. 
14. If xı and zə are a random sample of two from a population with 
f(x) = e~, x > 0, show that u = tı + z and v = zi/2s are inde- 


pendently distributed. 
15. If x, y, z have the joint density 


6 
Ieun) = Fryt zy 2>0 


find the distribution of u = « + y + 2. * r 
16. If zı and zs are a random sample of two from a population with 
the uniform distribution over the unit interval, find the distribution 


of u = cita. 
17. If x and y have the bivariate normal distribution, show that 


Ein. d uud 
Oz Oy 

and 

ae Sha ety 


v 
Oz Oy 


are independently normally distributed with zero means and variances 


2(1 + p) and 2(1 — p). 
217 


$10.10 SAMPLING DISTRIBUTIONS 


18. If z and y are independently normally distributed with zero 
means and unit variances, show that u = x? + y? and v = z/y are 
independently distributed. "What are the names of the individual 
distributions of u and v? 

19. Show that “Student’s” distribution approaches the normal 
form when the number of degrees of freedom becomes infinite. 

20. If a1, 2», * * * , % are a random sample from a normal popula- 
tion, find the joint distribution of 


k » 
uw=)a and v-YXm O<r<k<n 
1 T 


21. If x and y are independently distributed by chi-square laws 
with m and n degrees of freedom, respectively, show that u = x + y 
and v — z/y are independently distributed. 

22. Consider samples of size n from a bivariate normal distribution. 
Using the notation of Sec. 9.7, show that 


vn —1(&. — fs — f+ &) 
Vê + êz — 12 


has "Student's" distribution with n — 1 degrees of freedom. 

23. If x and y are horizontal and vertical components of the devia- 
tions of a shot from the center of a target, and if x and y have a bivari- 
ate normal distribution with zero means, p — 0.1, and standard devia- 
tions of 10 inches, find the equation of an ellipse which will contain a 
shot with probability .95. (Use the result of Prob. 7.) 

24. Find the mean and variance of (1/2) (x; — z)? for samples of 
size n from a normal population, and show that they approach the 
large-sample mean and variance, c? and 2e*/n, as n increases. 

25. If xı, 2», + + + , z; are independently and normally distributed 
with means w; and variance o2, show that 


k 
u = Y aiti 


where the a; are constants, is normally distributed with mean Day: 
and variance Zaje. "Then deduce the distribution of the sample mean 
from a normal population by putting a; — 1/k. 

26. Obtain a result similar to that of Prob. 25 when the z; have the 
multivariate normal distribution. 

27. Find the large-sample distribution for the estimator of the 
parameter 8 in the gamma distribution. 

218 


< 


PROBLEMS §10.10 


28. Find the large-sample distribution for the estimator of the 
parameter of the Poisson distribution. 

29. If (zi, 29, * ^ * ,2:), 8 — 1, 2, * * * , nis & sample of size n 
from the multinomial population with density 


k 
Ilp? u-0lyns-ly»s-l 
Y 


find the distribution of the variates n; — » Zia, and find their variances 
a 


and covariances. 

30. Verify that |e;|| defined in equation (9.19) is the inverse of 
\|o*|| given by equation (9.17). 

31. Evaluate the determinant of ||c;]] in Prob. 30. 


32. If ai, 2s, * - * , tn are independently normally distributed with 

the same mean but different variances o], o3, * * * , da, Show that 
De. /g2 Dui 

SEU andv = D(a; — u)?/c? are independently distributed. Show 


Zl/o? 
also that u is normal, while v has the chi-square distribution with 
n — 1 degrees of freedom. 

33. Let s? denote D(a; — %)2/(n — 1), the mean square for samples 
ofsizen. For three samples from normal populations (with variances 
c3, cà, and c), the sample sizes being nı, n», and ms, find the joint 
density of 


2 2 
s $2 
w= and ves 

ra s$ 


where the s,? s3, and sj are the sample mean squares. 

34. Let a sample of size nı from a normal population (with variance 
c2) have mean square sj, and let a second sample of size n: from a 
second normal population (with mean pe and variance o3) have mean 
Z and mean square sj. Find the joint density of 


"ma E 
ye me 2D) eu a 
$2 $83 


219 


CHAPTER 11 
INTERVAL ESTIMATION 


11.1. Confidence Intervals. A point estimate of a parameter is not 
very meaningful without some measure of the possible error in the 
estimate. An estimate Ê of a parameter @ should be accompanied 
by some interval about 6, possibly of the form 6 — d to 6 + d, together 
with some measure of assurance that the true parameter 0 does lie 
within the interval. Estimates are often given in such form. Thus 
the electronic charge may be estimated to be (4.770 + .005)10-!^ 
electrostatic unit with the idea that the first factor is very unlikely 
to be outside the range 4.765 to 4.775. . A cost accountant for a pub- 
lishing company in trying to allow for all factors which enter into the 
cost of producing a certain book (actual production costs, proportion 
of plant overhead, proportion of executive salaries, etc.) may estimate 
the cost to be 83 + 4.5 cents per volume with the implication that the 
correct cost very probably lies between 78.5 and 87.5 cents per volume. 
The Bureau of Labor Statistics may estimate the number of unem- 
ployed to be 2.4 + .3 millions at a given time, feeling rather sure that 
the actual number is between 2.1 and 2.7 millions. 

In order to give precision to these ideas, we shall consider a par- 
ticular example. Suppose a sample (1.2, 3.4, 0.6, 5.6) of four observa- 
tions is drawn from a normal population with unknown mean yu and 
known standard deviation 3. The maximum-likelihood estimate of u 
is the mean of the sample observations: 


2-27 (1) 


We wish to determine upper and lower limits which are rather certain 
to contain the true parameter value between them. 

In general, for samples of size four from the given distribution, the 
quantity à 


ys (2) 


will be normally distributed with zero mean and unit variance. & is 


the sample mean, and 34 isc/+/n. Thus the quantity y has a density 
220 


i 


CONFIDENCE INTERVALS §11.1 


1 
v 2r 
which is independent of the true value of the unknown parameter, 


and we ean compute the probability that y will be between any two 
arbitrarily chosen numbers. Thus, for example, 


fy) = 


e (3) 


P(—196 «y < 1.96) = hae f(y)dy = .95 (4) 
Tn this relation the inequality —1.96 < y, or 
"gie ae # 
34 


is equivalent to the inequality 
u < 3+ 34(1.96) = z + 2.94 
and the inequality 
y < 1.96 
is equivalent to 
p >z — 204 
We may therefore rewrite (4) in the form 
P(E — 2.94 <p <a + 2.94) = .95 (5) 
and substituting 2.7 for z, 
P(—.24 < p < 5.64) = .95 ` (6) 
Thus two limits have been obtained (—.24, 5.64), which we may say 


are 95 per cent certain to contain the true parameter value between 
them. 
The meaning of (6) needs to be examined carefully. It appears that 

u is the variable and that the statement implies that the probability 
that the variable pu lies between —.24 and 5.64is.95. "Thisis, of course, 
nonsense. gis a fixed number, the mean of the population sampled. 
Furthermore the true mean y either does or does not lie between — .24 
and 5.64. The only correct probability statements possible in this 
situation are 

P(—.24 < u < 5.64) = 1 
if u actually is between the numbers, or 

P(—.24 < p < 5.64) = 0 
if u is not between the numbers. Tt is possible, however, to give (6) 


a meaningful interpretation. 
221 


§11.1 INTERVAL ESTIMATION 


The statement in equation (5) does have meaning. The probability 
that the random interval,  — 2.94 to Z + 2.94, covers the true mean 
pis .95. That is, if samples of four were repeatedly drawn from the 
population, and if the random interval z — 2.94 to Z + 2.94 were 
computed for each sample, then 95 per cent of those intervals would be 
expected to contain the true mean u. We do therefore have consider- 
able confidence that the interval —.24 to 5.64 does cover the true mean. 
The measure of our confidence is .95 because before the sample was 
drawn, .95 was the probability that the interval we were going to 
construct would cover the true mean. In (5) the number .95 is a true 
probability; in (6) it is not a true probability although it is a measure 
of our confidence in the truth of the statement on the left of (6). We 
shall call it the confidence coefficient, or the fiducial probability, to dis- 
tinguish it from our ordinary concept of probability. And we shall 
rewrite (6) as 

P,(—.24 < u < 5.64) = .95 (7) 


and read it “The fiducial probability that the interval —.24 to 5.64 
covers the true mean is .95." The word fiducial indicates nothing 
more than that the probability associated with the given interval was 
.95 before the sample was drawn. 

The interval —.24 to 5.64 is called a confidence interval; more spe- 
cifically it is called a 95 per cent confidence interval, the confidence 
coefficient, or fiducial probability, being expressed as a percentage. 
We can obtain intervals with any desired degree of confidence. Thus, 
since 

P(—2.58 < y < 2.58) = .99 (8) 


a 99 per cent confidence interval for the true mean is obtained by 
converting the inequalities as before and substituting Z = 2.7 to get 


P,(—117 < u < 6.57) = .99 (9) 


It is to be observed that there are, in fact, many possible intervals 
with the same fiducial probability. Thus, for example, since 


P(—1.68 < y < 2.70) = .95 (10) 
another 95 per cent confidence interval for p is given by 
Pr(—1.35 < u < 5.22) = .95 (11) 


This interval is inferior to the one obtained before because its length 

6.57 is greater than the length 5.88 of the interval in (7); it gives less 

precise information about the location of u. Any numbers a and b such 
222 


CONFIDENCE INTERVALS §11.1 


that ordinates at those points include 95 per cent of the area under f(y) 
will determine a 95 per cent confidence interval. Ordinarily one would 
want the confidence interval to be as short as possible, and it is made 
so by making a and b as close together as possible, because the relation 
P(a «y <b) = .95 gives rise to a confidence interval of length 
(c/A/n)(b — a). The distance b — a will be minimized for fixed area 
when f(a) = f(b), as is evident on referring to Fig. 45. If the point b 
is moved a short distance to the left, the point a will need to be moved 
a lesser distance to the left in order to keep the area the same; this 
operation decreases the length of the interval and will continue to do 
so as long as f(b) < f(a). Since f(y) is symmetric about y = 0 in the 
present example, the minimum value of b — a for fixed area occurs 


Fra. 45. 


when b =:—a. Thus (7) gives the shortest 95 per cent confidence 
interval, and (9) gives the shortest 99 per cent confidence interval for 
[a 
The general method illustrated here is as follows: One finds, if 
possible, a funetion of the sample observations and the parameter to 
be estimated (the function y above) which has a distribution inde- 
pendent of the parameter and any other parameters. Then any prob- 
ability statement of the form P(a<y <b) = 7, where y is the 
function, will give rise to a fiducial statement about the parameter. 
This technique is applicable in many important problems, but in many 
others it is not, because it is impossible to find functions of the desired 
form which are distributed independently of any parameters. These 
latter problems can be dealt with by a more general technique to be 
described in Sec. 5. 1 
The idea of interval estimation can be extended to include simul- 
taneous estimation of several parameters. 'Thus the two parameters 
of the normal distribution may be estimated by some plane region R 
223 


§11.2 INTERVAL ESTIMATION 


in the so-called parameter space, the space of all possible combinations 
of values of u and c?. A 95 per cent confidence region is a region 
constructible from the sample such that if samples were repeatedly 
drawn and a region constructed for each sample, 95 per cent of those 
regions on the average would include the true parameter point (uo, 02). 

Confidence intervals and regions provide good illustrations of uncer- 
tain inferences. In (7) the inference is made that the interval —0.24 
to 5.64 covers the true parameter value, but that statement is not made 
categorically. A measure, .05, of the uncertainty of the inference is an 
essential part of the statement. 


Fi. 46. 


11.2. Confidence Intervals for the Mean of a Normal Distribution. 
The method used in the preceding section cannot ordinarily be used to 
estimate the mean of a normal population, because the variance o? 
is not ordinarily known. The function y takes the form (for samples 
of size n) 


Daum 1 
ys c/ A/n (1) 
and on converting the inequalities in, say, 
P(—1.96 « y < 1.96) = .95 (2) 
one finds 
P(z—190—- < p< z+ 196-5.) = .95 
( EL = (8) 


For a given sample, Z and n are known, but ¢ is not, so that limits for u 
cannot be computed. Of course, an estimate é could be substituted 
for ec, but then the probability statement would no longer be exact 
and might be very far wrong for small samples. 

224 


CONFIDENCE INTERVALS 811.2 


The way around this difficulty was shown by W. 8. Gossett (who 
wrote under the pseudonym of *Student") in a classic paper which 
introduced the 4 distribution. He is regarded as the founder of the 
modern theory of exact statistical inference. The quantity 


i—u 
t= 4 
MX(s — 2)*/n(n — 1) A 
involves only the parameter p and has the ¢ distribution with n — 1 
degrees of freedom which does not involve any unknown parameters. 
It is therefore possible to find a number, say tos, such that, 


P(-tu «t < tos) = [^ ftn — Dat = .95 (5) 
F(t) 

Tos 0 im t 
Fia. 47. 


and then to convert the inequalities to obtain 


Y PeT- 2 > ia 2 K 

PIE St E EET zu «B hos POEM =.95 (6) 

in which the limits can be computed for a given sample to obtain a 
95 per cent confidence interval. 

The number f£, is called the 5 per cent level of ¿ and locates points 
which cut off 2.5 per cent of the area under f(t) on each tail. Since 
JÙ) is symmetric about £ = 0, (6) gives the minimum 95 per cent 
confidence interval. Other confidence intervals can be obtained by 
using other levels of 4. Thus a 99 per cent confidence may be found 
by using the number t.o which cuts off area .005 on each tail of the ¢ 
distribution. 

Figure 48 shows the result of computing 50 per cent confidence 
intervals for 15 samples of size four actually drawn from a normal 
population with zero mean and unit variance. The intervals are 

225 


§11.3 INTERVAL ESTIMATION 


shown as horizontal lines above the y axis, and, as expected, about 
half of them cover the true mean zero. Similarly if 95 per cent con- 
fidence intervals were used, about 95 per cent of them would be 
expected to cover the true mean. If one consistently uses 95 per cent 
confidence intervals to estimate parameters and states each time that 
the interval contains the true parameter value, he can expect to be 
wrong in 5 per cent of those statements. 


r(x 


a 6 x 
Fie. 49. 


11.3, Confidence Intervals for the Variance of a Normal Distribu- 
tion. For samples of size n from a normal population, the quantity 


go = ee (1) 


where is the sample mean, has the chi-square distribution with n — 1 
degrees of freedom. Hence a confidence interval with confidence 
coefficient; y may be set up by finding two numbers, say a and b, such 
that 


P(a <x? <b) = [foe =y (2) 
226 


CONFIDENCE REGION §11.4 


On converting the inequalities, we obtain 


P ES F DE eek E | =y (8) 


which will determine a confidence interval for c?. 
Since the length of the confidence interval is 


E - J Y@-2 (4) 


the shortest confidence interval for a given sample would be obtained 
by choosing a so as to minimize [(1/a) — (1/b)] for the chosen value of 
y. The required computation is so tedious that it is rarely done in 
practice, and tables giving the required levels have not been published. 
The ordinary chi-square tables give numbers x? such that 


P(x? > x2) = [1906 = « (6) 


for selected values of e. In setting up, say, a 95 per cent confidence 
interval, one merely chooses a = x5, and b = X%95, 1.6., selects a and 
b so that area .025 is cut off from each tail of the distribution. This 
very nearly minimizes the length of the confidence interval unless the 
number of degrees of freedom is quite small. 

11.4, Confidence Region for Mean and Variance of a Normal Dis- 
tribution. In constructing a region for the joint estimation of the 
mean jo and variance oj of a normal distribution, one might at first 
sight be inclined to use the individual estimates given by the ¢ and the 
x? distributions. That is, for example, one might construct a .9025 
(= .95?) region as in Fig. 50 by using the two relations: 


assuming that the probability of both occurrences is the product of the 

separate probabilities. This is incorrect because t and x? are not 

independently distributed. The joint probability that the two inter- 

vals cover the true parameter values is not equal to the product of the 

separate probabilities. Hence the probability that the rectangular 

region of Fig. 50 covers the true parameter point (uo, o1) is not .9025. 
227 


§11.4 INTERVAL ESTIMATION 


A confidence region may be set up, however, by using the distribu- 
tions of Z and Z(z; — z)*, which are independently distributed. If, 
for example, a 95 per cent confidence region is desired, we may find 
numbers a, a’, and b’ such that 


P( a < Z <a) = VI = 975 (3) 
P L < Oe < v] ENAT (4) 


Zi (x; -X| VAE 


- 2-9 
Tta Eea 


The joint probability 


73 E — uo roit ca prs 5 
al CES er es EROS um (5) 
because of the independence of the distributions, The four inequali- 
ties in (5) determine a region in the parameter space which is easily 
found by plotting its boundaries. One merely replaces the inequality 
signs by equality signs and plots each of the four resulting relations 
as functions of u and c? in the parameter space. A region such as the 
shaded area in Fig. 51 will result. A confidence region for (uo, co) 
would be obtained in exactly the same way; the relations would be 
plotted as functions of ø instead of c°, and the parabola in Fig. 51 
would become a pair of straight lines 


S m 
Fotos, ZUA 


Fia. 50. 


un-it 


SE 


intersecting at Z on the p axis. 
228 


A GENERAL METHOD FOR OBTAINING CONFIDENCE INTERVALS §11.5 


The region we have constructed does not have minimum area, but 
it is easily constructible from existing tables and will differ but little 
from the region of minimum area unless the sample size is small. The 
minimum region is roughly elliptical in shape and difficult to construct. 


c^-X(u-x)Ma' * 


Vadis d 


(4.-X)- a*a*/n 


x 
Fie. 51. 


11.5. A General Method for Obtaining Confidence Intervals. The 
method used in the preceding sections for determining confidence 
intervals and regions required that functions of the sample and param- 
eters be found which were distributed independently of the param- 
eters. It is possible to set up confidence intervals, however, whether 
or not such functions exist. 

Given a population with density f(x; 6) and an estimator (a1, to, 

- ,2,) for samples of size n (one would ordinarily use the maximum- 
likelihood estimator), we may determine the density, say g(0; 0), of the 
estimator. We shall suppose, for definiteness, that a 95 per cent 
confidence interval is desired. If any arbitrary number, say 6’, is 
substituted for @ in g(6, 0), the distribution of 6 will be completely 
specified and it will be possible to make probability statements about 
6. In particular, we may find two numbers hı and hz such that 


P(6 < h) = Tan gô; 0)dô = .025 (1) 
Pê > h) = ls gô; 0')dô = .025 (2) 


The numbers hı and he will depend, of course, on the number substi- 

tuted for 0 in g(0; 0). In fact, we may write hı and he as functions of 

6: hı(0) and ho(@). The values of these functions for any value of 0 
229 


§11.5 INTERVAL ESTIMATION 
are determined by equations (1) and (2). Obviously 


Pihi(8) < 6 < h:(0)] = if uO (8; odô = .95 (3) 


Ti) 


The functions hı(0) and ho(#) may be plotted against 0 as in Fig. 52. 
A vertical line through any chosen value 6' of @ will intersect the two 
curves in points which, projected on the axis, will give limits between 
which 6 will fall with probability .95. 

Having constructed the two curves 6 = Ai(0) and 6 = hs(@), we may 
construct a confidence interval for @ as follows: Draw a sample of size 
^ and compute the value of the estimator, say 6’. A horizontal line 


Fra. 52. 


through the point 6’ on the 6 axis (Fig. 52) will intersect the two curves 
at points which may be projected on the @ axis and labeled 0, and 6» 
as in the figure. These two numbers define the confidence interval, 
for it is easily shown that 


Px(02 < 0 < 61) = .95 (4) 


Suppose that we were in fact sampling from a population that had 
6’ as the value of 0, The probability that the estimate 6 will fall 
between hi(0') and h2(0’) is .95. If the estimate does fall between 
these limits, then the horizontal line will cut the vertical line through 
0' at some point between the curves and the corresponding interval 
(82, 01) will cover 6’. If the estimate does not fall between hi(6’) and 
h2(6’), the horizontal line does not cut the vertical line between the 
curves and the corresponding interval (0, 01) does not cover 0. It 
230 


A GENERAL METHOD FOR OBTAINING CONFIDENCE INTERVALS §11.5 


follows, therefore, that the probability is exactly .95 that an interval 
(62, 01) constructed by this method will cover 6’. And this statement 
is true for any population value of 6. 

It is sometimes possible to determine the limits 62 and 4; for a given 
estimate without actually finding the functions Ai(8) and he(6). 
Referring to Fig. 52, the limits for 6 are at points 0» and 0; such that 
hi(@:) = 6’ and ho(@2) = 6’. In terms of the definition of hi and As, 
we may say that 6; is the value of 0 for which 


f g; 6)d8 = .025 (5) 
and 6; is the value of @ for which 
ji gô; €)dó = .025 (6) 


If the left-hand sides of these two equations ean be given explicit 
expressions in terms of 6, and if the equations can be solved for 0 
uniquely, then those roots are the 95 per cent confidence limits for 6. 
If hi(0; and h»(0) are not monotonic functions of 6, the confidence 
interval may in fact be a set of intervals. Thus suppose the curves 
of Fig. 52 bent down farther to the right so that the horizontal line at 
6’ cut them again, for example, at points 0; and 84. Then the confi- 
dence interval would actually consist of two intervals (62, 0:) and 
(03, 03). The fiducial statement about 6 would then be of the form 


Pr(0 < 0 < 61, or 03 < 0 < 0) = .95 (7) 


However, in most situations encountered in practice there will be a 
single interval, or it will be possible to select a single interval on the 
basis of other evidence concerning the experiment which produced 
the sample observations. 

The method described here for obtaining confidence intervals may 
be extended to the case of several parameters, but a geometrical 
representation becomes impossible even for two parameters. Suppose 
a distribution depends on two parameters 0; and 0»; we may find a 
plane region R in the 61, 62 plane such that 


P(6:, 62 in R) = f | (01, 023 6, 02)d0; dà; = .95 (8) 
R 


By considering all possible pairs of values of 6i and 62, we can generate 

a four-dimensional region in the 1, 92, 61, 62 space which is analogous 

to the two-dimensional region between the curves in Fig. 52. Now 
231 


§11.5 INTERVAL ESTIMATION 


suppose a sample is drawn and the estimates 6; and 6; calculated. The 
intersection of the two hyperplanes 6, = 6; and 62. = 6j with the 
four-dimensional region will determine a two-dimensional region, 
which, when projected on the 61, 0» plane, will be a 95 per cent confi- 
dence region for 61, 6». 

The argument may be extended to cover the case of k parameters. 
The method will determine a confidence region for all the parameters 
of a distribution. If one wishes to estimate some but not all of a set 
of parameters, the method can not be used in general, though it may be 
modified to handle the problem in special circumstances. There 
is as yet no general solution to the problem of setting up confidence 
regions for a part of a set of k parameters in a distribution function 
except in the case of large samples. 

Illustrative example: As a simple illustration, we may consider the 
estimation of a in 


Jei) =2(a-2) 0<aK<a (9) 


for samples of size one. If x is the observation, the maximum-likeli- 
hood estimator is found to be à = 2z by solving 


£|2e-»]-o 


foro. The distribution of the estimator is 
1 
g(â; a) = 2 (2a — à) 0 « à « 2a (10) 
so that 95 per cent confidence intervals are obtained by determining 


hi(a) and ho(a) so that 
i. MO 0 (a; a)dà = .025 (11) 


2a 
if No I8; Oda = 025 (12) 


The integrations are easily performed in this case and give, on solving 
for hı and he, 


hila) = 2(1 — »/.975)a (13) 
hala) = 2(1 — 4/.025)a (14) 


"These plot as straight lines, as in Fig. 53. For a given observation, 
say « = 2, the estimate is a’ = 4 and the 95 per cent confidence inter- 
232 


CONFIDENCE INTERVALS §11.6 


val is given by 


2 2 
Pr <a< = 95 15 
, [i =a CU m ven) d 
Actually, since 
2a — à 
mae es 
a 


is distributed independently of o, it was not necessary to use the gen- 
eral method in this problem. We could have found a confidence 
interval for a by getting .95 limits for u and then converting the 
inequalities to get a statement about a. 


^ 


Fra, 53. 


11.6. Confidence Intervals for the Parameter of a Binomial Distri- 
bution. We shall apply the general method described in the preceding 
section to a problem which requires itsuse. Ifa sample, 2:1 25 * * * , 
%p, is drawn from a binomial population with 


jen ew (pes = 0,2 (1) 
the maximum-likelihood estimator of p is 
zy "SEC; 
2 (2) 
where y = Xz; can have the values 0, 1, 2,---,m. The density of 


pis 
n n-p) c Odd S 1 (3) 
o: p) = (ap) PC —» $-0»57 ; 
and it is not possible to find function of f and p which is distributed 


independently of p. 
233 


811.6 INTERVAL ESTIMATION 


Again we shall suppose for definiteness that a 95 per cent confidence 
interval is to be constructed. The first step is to determine the func- 
tions hi(p) and As(p). For p = .4, for example, we would, in accord- 
ance with the preceding section, seek a number hı(.4) such that 


nhi 


Pip < 14(4)] = b (^) (4) (.6)^-» = .025 (4) 


y-0 


However, in view of the disereteness of the distribution, n; in the 
sum must be an integer, and it will be impossible to make the sum 
exactly .025 for every value of p. This need not worry us though. 
We do not need a curve hı(p) defined at every p. The only points of 
interest are those which correspond to the possible values of f. Tt is, 
in fact, possible to use the technique indicated by equations (5.5) and 
(5.6) of the preceding section, because an explicit expression for the 
probabilities on the left of these equations is immediately at hand. 
Assuming we have an estimate 


pi (5) 


EIE 


the 95 per cent confidence upper limit p; may be determined by finding 
the value of p for which 


Y (ora -p= = ms © 


y=0 


and the lower limit p» is the value of p for which 


Y (") p — py-* = 025 (7) 


yak 


If k is zero, the lower limit is taken to be zero, and if k = n, the upper 
limit is taken to be one. 

For small values of n, equations (6) and (7) may be solved by trial 
and error for the roots pı and ps, but this computation rapidly becomes 
tedious with increasing n. A simple method of solution is provided 
by Pearson's tables of the incomplete beta function. The cumulative 
form of the beta distribution is 


(a +B 4- 1)! 
ale! 
234 


F(z; a, 8) = h te(1 — t) dt (8) 


CONFIDENCE INTERVALS FOR LARGE SAMPLES 811.7 


and repeated integration by parts gives 


P(e; a, 8) = — Y (° TE ') naay O) 
» i-o 

It follows that partial binomial sums are given by the table of F(x; a, 

8). We may write equation (6) as 


k 

» (") pl — p)" = 1 — F(p; k,n — k — 1) = 025 (10) 
y=0 y 

and find at once in the table the value of p which corresponds to 
F = 975 for the given values of k and n — k — 1. Similarly, since 


$ (") pul — pyr =1-— 5: (") p= p) 


we may find the lower confidence limit by putting (7) in the form 


s (") pl — py" = F(p;k— 1, n — k) = .025 (11) 
k 


For values of n beyond the range of the table, the normal approxi- 
mation to the binomial distribution may be used to obtain confidence 
intervals for p, as is shown in the following section. 

11.7. Confidence Intervals for Large Samples. We have seen in 
Chap. 10 that for large samples, the maximum-likelihood estimator 6 
for a parameter @ in a density f(x; 0) is approximately normally dis- 
tributed about @ under rather general conditions. When these con- 
ditions are satisfied, it is possible to obtain approximate confidence 
intervals quite easily. The large-sample variance of the estimator is, 
say, 


(1) 


zl 
2 = 
O) nE [9? log f(x; 0)/30°] 
and we have indicated that it is a function of 0 since it ordinarily will 
depend on 6. For large samples, therefore, a confidence interval with 
fiducial probability y may be determined by converting the inequali- 
ties in 


p|-a <t en] (2) 


§11.7 INTERVAL ESTIMATION 


where d, is chosen so that 


dy ] 
—— eedi = y 
love 


As an example, we may consider the binomial distribution with 


parameter p; the variance of f is 


exp) = PLP) (3) 


n 


An approximate y confidence interval, for example, is obtained by 
converting the inequalities in 


P| a eh «de 4) 
| a YE 
to get 
secutus — d, V/4np + d? — nt c 
2(n + d,?) 
2nj + dy? + d, A/ Anf + d? — Anp? Jer 
< 2(n + dj) (5) 


These expressions for the limits may be simplified if we recall that in 
deriving the large-sample distribution, we neglect certain terms con- 
. taining the factor 1/+/n; i.e., the asymptotic normal distribution is 
correct only to within error terms of size k/1/n. We may therefore 
neglect terms of this order in the limits in (5) without affecting the 
accuracy of the approximation. This means simply that we may omit 
all the d? in (5), because they always occur added to a term with factor 
n and will be negligible, relative to n when n is large, to within the 
degree of approximation we are assuming. Thus (5) may be rewritten 


as 
P| p~ 44/079 <p <p +a, [B], (6) 


In particular, 


P|s - 196, [fL — 9 <p <p +196 P 9 |a os 


gives an approximate 95 per cent confidence interval for p for large 
samples. : 
We may observe that (6) is just the expression that would have been 
obtained had f been substituted for p in c?(p). This substitution 
236 


CONFIDENCE REGIONS FOR LARGE SAMPLES §11.8 


would imply that 


Dn 
VP — p)/n 


is approximately normally distributed with zero mean and unit vari- 
ance. It is, in fact, true in general that in the asymptotic normal 
distribution of a maximum-likelihood estimator 6, the variance o7(6) 
may be replaced by its estimator c*(0) without appreciably affecting 
the accuracy of the approximation. We shall not prove this fact 
but shall use it because it greatly simplifies the conversion of inequali- 
ties in. a probability statement to get confidence intervals. 

For large samples, therefore, an approximate confidence interval 
with confidence coefficient y is given by 


P(6 — dy(6) <0< ô + doe()) = 7 (7) 


when 6 is asymptotically normally distributed, and (6) in this expres- 
sion is the maximum-likelihood estimate of the standard deviation of 
9. 

11.8. Confidence Regions for Large Samples. When a distribution 
involves several parameters (01, 0», * * * , 0x), we have seen in Chap. 10 
that under rather general conditions the large-sample maximum-like- 
lihood estimates, (Ôn, 62, * * * , 6s), are approximately normally dis- 
tributed with means (61, 02, °° * , Ox) and coefficients of the quad- | 
ratie form given by 


ts ð? log f(a; 61, Oz, © > + , Ox) 
le*(&, «+ * 69] = E Ue 


|o 


The coefficients will, in general, be functions of the 6; as we have 
indicated. 

Now we have seen that the quadratic form of a k-variate normal 
distribution has the chi-square distribution with k degrees of freedom. 
We may conclude, therefore, that the quantity 


k P" 
u= P p oi(s, - > + , O) (Âs — 0306; — 6) (2) 


is approximately distributed by the chi-square distribution with k 
degrees of freedom for large samples. Here again, the accuracy of the 
approximation is not impaired by substituting the estimates of the 
6; for the 6; in o;;(01, * * * , 6:); the quantity 


v = Zxih,ó,-::,6)(6— 6) (8; — 6j) (8) 
237 


§11.8 INTERVAL ESTIMATION 


is also approximately distributed by the chi-square law with k degrees 
of freedom. 

The variate » enables us to set up a very simple confidence region 
for the 0j. If x? , is the 1 — y level of the chi-square distribution, 
then 

PU xi) = ¥ (4) 


determines a confidence region in the parameter space. The boundary 
of the region is given by the equation 


ZZo*(Ó, > + > , Ô Â — 0:)(6; — 86) = xi. (5) 


which is the equation of an ellipsoid in the (61, 62, * * * , 6,) space 
with its center at (61, Ês, © © - , 6;). 

If one is interested in estimating only a part of a set of k parameters, 
for example, the set (61, 0», * * * , 6,) where r < k, we first find the 
marginal distribution of the maximum-likelihood estimators for this 
set. If we let (a, b) be indices which have the range 1, 2, * +: , 7, 
then the coefficients 3% of the quadratic form of the large-sample 
normal distribution of 61, 62, - - - , 6, are given by 


lael] = leal? 


where the matrix ||oa|| is obtained by striking out the last k — r rows 
and columns in|ls;|. The z^ will, in general, be functions of all k 
of the original parameters 6i, 0», * * * , 64. If we substitute the 6; 
for the 0; in 2^, we shall obtain the maximum-likelihood estimators 
8^ of the o^. The quadratic form 


w= yy area — 0.)(b, — &) 


is approximately distributed like chi square with r degrees of freedom 
and will serve to determine an ellipsoidal confidence region in the 
01, 02, + * * , 0, space for those parameters. 

As an example of the estimation of more than one parameter, we 
may consider the large-sample estimation of the mean and variance 
of a normal population. We have seen in Sec. 10.9 that Z and ¢? are 
approximately distributed with means u and c? and with coefficients 
of the quadratic form 


lelu, oD = ||” (6) 


CONFIDENCE REGIONS FOR LARGE SAMPLES §11.8 
If we substitute é? for c? in (6), then the quadratic form becomes 
"n 2 LA 
v= A - Mt d ga @ — e (7) 


which is approximately distributed like chi square with two degrees 
of freedom for large samples. In particular, let us suppose that we 
have an actual sample of 100 observations (3.4, 5.1, + > - , 2.2) with 


Z = Yootau = 4 
= 00 D(a —_ z) eue 


_ [G99 x 
SESS ES 
. [539 x 

VP =055 


g? 


Be) o eS 


Fra. 54. 


since the .05 level of chi square with two degrees of freedom is 5.99, a 
95 per cent confidence region for p and o? is determined by 


P [20(4 — u)? + 2(5 — 0%)? < 5.99)] = .95 (8) 


The values of u and c? which satisfy the inequality in (8) are the 
points within the ellipse 


20(4 — u)? + 2(5 — 02)? = 5.99 


which is plotted in Fig. 54. This is the 95 per cent confidence region 
for the true parameter point, say (uo, aô): Before the sample was 
drawn, the probability was about .95 that the region we were going to 
construet would cover the true parameter point. 

The large-sample confidence intervals and regions presented in this 
and the preceding section have an optimum property which we shall 
point out but not prove. In the earlier sections of the chapter, we 
were concerned with finding the shortest interval for a given fiducial 

239 


§11.9 INTERVAL ESTIMATION 


probability. Thus the shortest 95 per cent interval for the mean of a 
normal population when c is known is given by 


P(n- E utt 95 

n Mn 

and the length of the interval is 2 X 1.96¢/+/n, where n is the sample 
size. Now let us suppose that, instead of using Z = (1/n) Da; to con- 
struct the confidence interval, we used only one of the observations, 
say the first. The estimator is simply 


[IE 
and the confidence interval is given by : 
P(à — 1.960 < u < & + 1.960) = .95 


which has length 2 X 1.96s. This interval is y/n times as long as the 
one obtained by using the sample mean as the estimator. 

It is now evident that the length of a confidence interval for a param- 
eter depends strongly on what function of the sample observations is 
chosen as an estimator. The optimum property of the large-sample 
intervals and regions based on maximum-likelihood estimators is this: 

Large-sample confidence intervals and regions based on maximum- 

likelihood estimators will be smaller on the average than intervals and 
regions determined by any other estimators of the parameters. 
This property of maximum-likelihood estimators is closely related to 
the fact that they are efficient, i.e., that they have smaller variance in 
large samples than other estimators. By “other estimators” we mean 
functionally different estimators; one would obtain essentially the 
same confidence regions by using estimators which were functions of 
the maximum-likelihood estimators. The phrase ‘‘on the average" 
refers to the fact that confidence regions usually vary in size from 
sample to sample (see Fig. 48), and for a given sample a region deter- 
mined by some other estimators may be smaller than the region 
determined by the maximum-likelihood estimators. But for repeated 
sampling, the average size of the regions determined by maximum- 
likelihood estimators will be smaller than the average size of regions 
determined by other estimators. 


11.9. Problems 


1. Find a 90 per cent confidence interval for the mean of a normal 
distribution with « = 3, given the sample (2.3, —.2, —.4, —.9). 
What would be the confidence interval if e were unknown? 

240 


PROBLEMS §11.9 


2. The breaking strengths in pounds of five specimens of manilla 
rope of diameter 245 inch were found to be 560, 480, 540, 570, 540. 
Estimate the mean breaking strength by a 95 per cent confidence 
interval, assuming normality. Estimate the point at which only 5 per 
cent of such specimens would be expected to break. 

3. Referring to Prob. 2, estimate c? by a 90 per cent confidence 
interval; also c. 

4. Referring to Prob. 2, plot an 81 per cent confidence region for 
the joint estimation of » and c?; for u and c. 

5. Five samples were drawn from populations assumed to be 
normal and assumed to have the same variance. The values of 
s? = D(a; — 7)? and n, the sample size, were 


s?: 40 22 17 42 45 
n: 6 4 3 7 8 
Find 98 per cent confidence limits for the common variance. 
6. The largest observation z/ of a sample of n from a rectangular 
density f(z) = 1/0 (0 < x < 6) has the density 


fe) - 22 0 «a «0 


Show that u = a'/0 is distributed independently of 0. Using u, find 
the shortest confidence interval for 6 for fiducial probability y. 

7. Compute a 95 per cent confidence interval for the range of a 
rectangular distribution given the sample (2.6, 1.2, 4.3, 1.6), and given 
that the lower limit of the range is zero. 

8. To test two promising new lines of hybrid corn under normal 
farming conditions, a seed company selected eight farms at random in 
Towa and planted both lines in experimental plots on each farm. The 
yields (converted to bushels per acre) for the eight locations were: 


Line A:86 87 56 93 84 93 75 79 
Line B: 80 79 58 91 77 82 74 66 
Assuming the two yields are jointly normally distributed, estimate the 
difference between the mean yields by a 95 per cent confidence interval. 
(Refer to Prob. 22 of Chap. 10.) 
9. Using the density 


3 
f(z) = Eon 0<4<8 
for the largest of four observations from a rectangular population, set 


up a general system of 95 per cent confidence intervals for 0 by finding 
241 


§11.9 INTERVAL ESTIMATION 


the functions hi(0) and h2(@) and plotting these in the (6, 0) plane. 
Find the interval for the sample given in Prob. 7. Why does it differ 
from the interval found in that problem? 

10. Referring to Prob. 9, plot the functions hi(0) and he(@) for 
samples of size eight. Then show in general that the lengths of the 
intervals decrease as the sample size n increases. 

11. The sample (2.3, 1.2, 0.9, 3.2) was drawn from a population 
distributed by f(x) = ae-*, x 2 0. Find a 90 per cent confidence 
interval for a. , 

12. Referring to Prob. 11, find 90 per cent, confidence intervals for 
the mean and for the variance of the distribution. What is the 
fiducial probability that both these intervals cover the true mean and 
true variance, respectively ? 

13. One head and two tails resulted when a coin was tossed three 
times. Find a 90 per cent confidence interval for the probability of a 
head. 

14. 160 heads and 240 tails resulted from 400 tosses of a coin. Find 
a 90 per cent confidence interval for the probability of a head. Find 
a 99 per cent confidence interval. Does this appear to be a true coin? 

15. A sample of 2000 voters were asked their attitude toward a 
certain political proposal. 1200 favored the proposal; 600 opposed 
it; and 200 were undecided. Assuming this was a random sample 
from a trinomial population, construct a 95 per cent confidence region 
for pı and ps, the proportions of individuals for and against the pro- 
posal. (Use the results of Sec. 10.9.) 

16. Plot a 95 per cent confidence region like that of Fig. 51 for the 
example used in Sec. 8 and compare it with the region of Fig. 54. 

17. Integrate by parts [integrating (1 — £)* and differentiating t] 
to show 


z 
1 1 
t 1 —t 5 dt = — T Lee al PP GEN r—1 d al 
fin ) zirla) +t fra d+ dt 

18. Apply the above result repeatedly to obtain a cumulative form 
for the beta distribution, F(x; a, 8). 

19. Show that 

at+p+1 


F(x; a, B) = ` ar i an 2) ail — gator 


i=a+1 


by using the result of Prob. 18. This is the form that would have 
arisen had the integration by parts been done the other way—differ- 
entiating (1 — ¢)* and integrating t". 

242 


PROBLEMS §11.9 


20. Given a sample of size 100 from a normal population with 
p = 3,6? = .25, what is the maximum-likelihood estimate of the num- 
ber æ for which 


3 1 - (1/202) (zi—4)? dqy = 
Í ize € dz = .05 

21. Find the large-sample distribution of @ and ¢ for samples from a 
normal population. Since it is known that @ and é will be normally 
and independently distributed with means p and c, it is only necessary 
to find their variance. 

22. Referring to the above problem, find the large-sample distribu- 
tion of a + kê where k isa given constant. Use this to obtain a 95 per 
cent confidence interval for a in Prob. 20. 

23. Develop a method for estimating the ratio of the variances of 


two normal populations by a confidence interval. 
24. Develop a method for estimating the parameter of the Poisson 


distribution by a confidence interval. (Refer to Prob. 33 of Chap. 6.) 
25. Work through the details of the derivation of equation (2.6). 
26. What is the probability that the length of a t confidence interval 

will be less than c for samples of size 20? 

27. Compare the average length of a 95 per cent confidence interval 
for the mean of a normal population based on the ¢ distribution with 
the length that the interval would have were the variance known, 

28. Show that the length and the variance of the length of the ¢ 
confidence interval approach zero with increasing sample size. 

29. How large a sample must be drawn from a normal population to 
make the probability .95 that a 90 per cent confidence interval (based 
on £) for the mean will have length less than c/ 5? 

30. Show that the length of the confidence interval for ø (of a normal 
population) approaches zero with increasing sample size. 

31. Consider a truncated normal population with density 


g cnet a<a 


fen Saute 


z 1 2/02 
a= g- cote dg 


where 


-o Vln 


Show that 2 log f(x) and z log f(x) have zero expectations. 
u 


32. Referring to Prob. 31, let 2 and ê be maximum-likelihood esti- 
mators of u and c. Show that the matrix of coefficients of the quad- 
243 


$11.9 INTERVAL ESTIMATION 


ratic form for the large-sample distribution of @ and ¢ is 


ni — th — 62) —nb(l + tb + t2) 
" E gi c? 
leal = na ee) ng t — e et) 
c? c? 


where b = of(a), and where t = (a — u)/c. 


CHAPTER 12 
TESTS OF HYPOTHESES 


12.1. Introduction. There are two major areas of statistical infer- 
ence—the estimation of parameters and the testing of hypotheses. We 
shall study the second of these two areas in this chapter. Our aim 
will be to develop gencral methods for testing hypotheses and to apply 
those methods to some common problems. The methods will be of 
further use in later chapters. 

In experimental research, the object is sometimes merely to estimate 
parameter. Thus one may wish to estimate the yield of a new hybrid 
line of corn. But more often the ultimate purpose will involve some 
use of the estimate. One may wish, for example, to compare the 
yield of the new line with that of a standard line and perhaps recom- 
mend that the new line replace the standard line if it appears superior. 
This is a common situation in research. One may wish to determine 
whether a new method of sealing light bulbs will increase the life of the 
bulbs, whether a new germicide is more effective in treating a certain 
infection than a standard germicide, whether one method of preserving 
foods is better than another in so far as retention of vitamins is con- 
cerned, and so on. 

Using the light-bulb example as an illustration, let us suppose that 
the average life of bulbs made under a standard manufacturing pro- 
cedure is 1400 hours. It is desired to test a new procedure for manu- 
facturing the bulbs. The statistical model here is this: We are 
dealing with two populations of light bulbs—those made by the 
standard process and those made by the proposed process. We know 
(from numerous past investigations) that the mean of the first popula- 
tion is about 1400. The question is whether the mean of the second 
population is greater than or less than 1400. To answer this question, 
we set up a null hypothesis, namely, the hypothesis that the two means 
are the same. On the basis of a sample from the second population 
we shall either accept or reject the null hypothesis. (Naturally we hope 
that the new process is better and that the null hypothesis will be 
rejected.) The reason for this roundabout way of doing things will 


become apparent later. 
245 


§12.2 TESTS OF HYPOTHESES 


To test the null hypothesis, a number of bulbs are made by the new 
process and their lives measured. Suppose the mean of this sample 
of observations is 1550 hours. The indication is that the new process 
is better, but suppose the estimate of the standard deviation of the 
mean ¢/+/n is 125 (n being the sample size). Then a 95 per cent 
confidence interval for the mean of the second population (assuming 
normality) is roughly 1300 to 1800 hours. The sample mean 1550 
could very easily have come from a population with mean 1400. We 
have no strong grounds for rejecting the null hypothesis. If, on the 
other hand, ¢/+/n were 25, then we could very confidently reject the 
null hypothesis and pronounce the proposed manufacturing process 
to be superior. 

The testing of hypotheses is seen to be closely related to the problem 
of estimation. It will be instructive, however, to develop the theory 
of testing independently of the theory of estimation, at least in the 
beginning. 

12.2. Test of a Hypothesis against a Single Alternative. In the 
example considered above, there were many alternatives to the null 
hypothesis; the mean of the second population could have been any 
positive number within a fairly wide range. To introduce the basic 
notions of testing hypotheses, we shall consider the very simple case 
of one alternative. Suppose it is known that a population has either 
the density f(x) or the density f;(z), and suppose it is desired to test 
on the basis of one observation whether the true density is fo(x) or 
f(x). Let us designate by 


Hy: the hypothesis that f(x) = fo(z) 
and by 


Hı: the alternative hypothesis that f(x) = fi(z) 


We shall call Ho the null hypothesis; rejection of Ho will be equivalent 
to acceptance of Hı. 

To test Ho, we shall choose a number A (see Fig. 55) and make an 
observation xı If vı < A, we shall accept Ho; if zi > A, we shall 
reject Ho. 

There are two kinds of error possible in this test. We may reject 
Hy when it is in fact true; i.e., the population may have fo(x) as its 
distribution even though the observed z did exceed A. This is called 
the Type I error of the test, and for the example of Fig. 55 its probabil- 
ity is obviously 


fE fas 
246 


TEST OF A HYPOTHESIS AGAINST A SINGLE ALTERNATIVE 812.2 


This probability is often called the significance level of the test. A 
second possible error is the acceptance of Ho when it is false; i.e., the 
observation may be less than A even though the true population dis- 
tribution is fi(z). This is called the Type II error of the test, and in 
the example we are considering its probability is 


rte fi(z)dz 


The interval z < A is called the acceptance region for the null 
hypothesis, and the interval z > A is called the rejection region, or 
more often the critical region. The construction of a test is nothing 
more than a matter of dividing the x axis into two regions, and this 


f(x) A(x) 


ab c A d x 
Fra. 55. 


may be done quite arbitrarily. We might set up a test as follows (see 
Fig. 55): 

Accept Hy if z < a orz > b 

Reject Hy) ifa < x < b 


This is clearly a poorer test than the one described first. We may 
make the two tests comparable on one score by making the probabili- 
ties of their Type I errors the same, say .05; i.e., A may be chosen so 
that 


j^ fola)dx = .05 
and a and b chosen so that 
i ^ fo(a)dx = .05 
The superiority of the first test is then apparent in the Type II errors, 
fé nenas < [^ ne EO 


The second test is much more likely to accept Ho when it is false. 

A good test is clearly one which makes the probabilities of both errors 
as small as possible. However, it is impossible to reduce both errors 
simultaneously with a single observation. The common procedure is 

247 


§12.2 TESTS OF HYPOTHESES 


to fix the Type I error arbitrarily (make it have probability .05, for 
example) and then choose the critical region so as to minimize the 
probability of a Type II error. The quantity 


1 — probability of a Type II error 


is called the power of the test. The power of the first test (based on the 
intervals  < A and x > A) we have described is 


1— [4 nexis = [7 hode 


Tn terms of this concept, the principle for setting up a test is to fix the 
probability of a Type I error and then choose a critical region so as to 
maximize the power of the test. 

Returning to the example of Fig. 55, we can now set up the best test 
of the null hypothesis for given size of the Type I error. Suppose we 
wish the Type I error to have probability .05. Our problem is to 
divide the x axis into two regions (two intervals or two collections of 
intervals), one of which will be the acceptance region and the other 
the critical region. We may concentrate on the critical region, and 
having selected it, the remainder of the axis will be the acceptance 
region. The critical region is to be such that the area under f(x) over 
the critical region is .05, and such that the power will be maximized, 
i.e., such that the area under fi() will be as large as possible over the 
critical region. 

Certainly the critical region will include every x to the right of z = d, 
the upper limit of the range of fo(x). We can include still more of the 
area under fi(z) so long as we do not make the area under fo(w) exceed 
05. The best values of z to choose are obviously those for which 
f(x) is as large as possible relative to fi(z). We want fila) to be large 
so that the area under fi(z) will be large, and we want fo(x) to be small 
so that as much of the area under f,(x) ean be included as possible 
without taking in more than .05 of the area under f(x). The best 
critical region is clearly the interval x > A where A is chosen so that 


Ne fo(x)dx = .05 


Other best tests would be determined by changing the specification 

of the probability of the Type I error. In the present illustration, for 

example, the Type I error could be made zero, and the best critical 

region would be x >d. This is the test one would make if he were 

particularly anxious to avoid rejecting Hy when it was true, but was 
- 248 


TEST OF A HYPOTHESIS AGAINST A SINGLE ALTERNATIVE §12.2 


not greatly concerned about rejecting Hı when it was true. To refer 
back to the light-bulb illustration, Ho may refer to the standard manu- 
facturing process. One would not want to go to the expense of chang- 
ing the process unless he was rather certain the new process was 
superior. Of course, such a decision as this would not ordinarily be 
based on one observation in practice. 

The general method of setting up critical regions in the case of one 
alternative is quite simple. Suppose we are testing Ho against Mi as 
before. The inequality 

fier) 

nle) >k (1) 
where k is an arbitrarily chosen number, will be satisfied by certain 
values of x. These values of x form a critical region for a best test, the 
test for which the Type I error is given by integrating f(x) over the 
region. ‘Thus in Fig, 55 if we choose 


k= f(A) 
JA) 


the set of values of z for which (1) is satisfied is just thesetz > A. By 
reducing k, we would get another set of x values, x > A’, where A’ 
would be some number to the left of A. The test would be more 
powerful (would have greater probability of accepting H, when it was 
true) but would have larger probability of a Type Terror. By chang- 
ing k, the probability of a Type I error may be made to have any 
desired value. A general criterion for constructing tests may be 
stated thus: 

To set up a best test for a given probability a of a Type I error, one 
chooses as a critical region the set R of points x such that: 


fix) > kfolx) 


where k is selected so that: 


f, ficis =a 


This criterion refers to a test for a single alternative H, and a single 
observation. It is almost obvious that the given method of choosing Æ 
will maximize the power of the test. A formal proof would go some- 
what as follows: Consider the possibility of replacing a small interval 
Az’ about a point 2’ in R by an interval Az" about a point z" not in R. 
(We may think of R as the interval x > A in Fig. 55.) Let the 


length of Az" be so chosen that the probability of the Type I error is 
249 ' 


812.2 TESTS OF HYPOTHESES 


unchanged by the replacement, i.e., so that approximately 
Foe’) Ax” = fo(x’)Ax’ 


The substitution will decrease the power by about fi(z/)Az/ and 
increase it by about fi(z")Az". Since 2’ is in R, 


fi(z')Aa' > kfo(x’)Az’ 

and since x” is not in R, 
file’ )Ax" < kfiz")aa" 

The right-hand sides of these last two expressions are equal, however; 
hence 

fi(z")Az" < fi(z')Aa" 
and any such replacement would necessarily reduce the power of the 
test. 


AG) 
(x) 


a a’ 6’ 6 x 
Fia. 56. 


To illustrate the method further, we may consider the situation in 
Fig. 56. A critical region for k = 34 is given by the interval 


a<a<b 


The corresponding acceptance region is, of course, the pair of inter- 
vals « a and z >b. The test has fairly high power in that Ho will 
often be rejected when H; is true, but its Type I error is large. If we 
choose a test with small probability, say .05, of a Type I error, then 
the critical region would become a’ < z < b’, and the null hypothesis 
would be accepted 95 per cent of the time when it was true. But now 
the power of the test is small; Ho will not often be rejected when it is 
false, i.e., when Hy is true. The power is, however, as large as it can 
be made for the given probability of a Type I error. This situation 
can be improved by taking more observations; we have been consider- 
ing only tests based upon a single observation. 

When a test is to be based on a sample of several observations, the 
construction is essentially the same as that we have already examined. 

250 


TEST OF A HYPOTHESIS AGAINST A SINGLE ALTERNATIVE §12.2 


Suppose a sample of two observations is to be used to test Hy against 
H,. The sample density is 
Sdf) 


defined over the z;, z» plane. A test is defined by selecting a critical 
region R in the plane, accepting Ho if the sample point (xı, x) falls 
outside R, and rejecting Ho if the sample point falls inside R. Here 
again the best test is given by selecting R to be the set of points (21, 22) 
such that 


fief) 
Fols) fol) 
R, 
ELL A 
Fie. 57. 


The probability of a Type I error is 
J [ folesnfolws)des dn 
R 


and for that probability the power of the test 
ECOL dzz 


R 
is maximized. 

The generalization to samples of size n is immediate. The sample 
observations (£1, £2 * * * tn) may be plotted as a point in an n-dimen- 
sional space. The space is divided into two regions—the critical 
region R and the acceptance region. Tf the sample point falls in R, 
Ay is rejected; otherwise Ho is accepted. The best critical region Æ 
will consist of those points (zi T2 * * * , Ta) in the n-dimensional 
space for which the likelihood ratio 


finie) + c fino) 


PENSE) « c fons) 
251 


812.3 TESTS OF HYPOTHESES 


exceeds some number k, and 4 is so chosen that the test has the desired 
probability of a Type I error. This probability is, of course, 


| | ed j fo(as)fo(as) * - - fo(zs)dzi dt, +++ dit, 
R 


We shall not actually have to deal with n-dimensional spaces because 
we shall be concerned with tests of parameter values and such tests 
can often be based on the distribution of an estimator of the parameter. 

12.3. Tests for Several Alternative Hypotheses. A common prob- 
lem in testing hypotheses is that of testing a particular parameter 
value, say 0, against a set of other values of 0 for a family of distribu- 
tions f(x; 0). The basic ideas may be illustrated by a particular 
example. Suppose a population is known to have a normal distribu- 
tion with c? = 1, and suppose it is further known that the mean y is 


f(x) 


Mo A 4i x 
Fra, 58. 


greater than or equal to some given number po. On the basis of an 
observation x, we shall test the null hypothesis, 


Ho: u = uo (1) 


The alternatives to this hypothesis are all the values u > po. On the 
basis of an observation x, we shall accept Ho (state that u = yo) or 
reject Ho (state that u > uo). We shall require a test for which the 
probability of a Type I error is, say, .05. 

If a particular value p’ of u is considered, the best test of uy against 


that value is given by choosing as a critical region the set of points for 
which 


f; ^) > Kf(w; uo) (2) 
or, using the specific form of the distribution, 
1 1 
— elev? S fp — g-MG-u)t (3) 
p M 2r 


252 


TESTS FOR SEVERAL ALTERNATIVE HYPOTHESES §12.3 


After canceling 1/+/2z and taking logarithms, this inequality may be 
put in the form 
2logk + u’ — uj 


> 39 = to) 5 
The best critical region is, therefore, an interval z > A, and A is to be 
chosen so that 
f; 3 ES eHow de = 05 (5) 
The value of A is determined from the normal tables to be 
A = po + 1.65 


in the present example. It was, of course, to be expected that the 
critical region would be of the form z > A. 

An important thing to observe here is that the critical region is 
independent of the selected value p^. Any value of y greater than po 
would have given rise to exactly the same critical region. For we 
should have found that the best critical region was of the form « > A 
regardless of the value given p’, and the determination of the value of A 
depends only on po and the selected probability of a Type I error. 

We shall see later that this is not a general situation. It is not in 
general true that the inequality 


F(x; 0) > Ife; %) 


will give rise to the same critical region for all possible values of 0 
alternative to a value 6, specified by a null hypothesis. When it is 
true that all alternatives give rise to the same critical region, the test 
is called a uniformly most powerful test. We shall see that uniformly 
most powerful tests do exist for many important problems in statistics, 
while there are other equally important problems which do not have 
uniformly most powerful tests. 

Going back to the problem of testing uo against all u > yo, let us 
consider the power of the test for a particular value of u. The power 
is the probability of rejecting Ho when it is false (when the true mean 
is u > uo) and is given by 


LA EIL 
Ll. ehe dz 
ie 


This quantity will clearly be a function of u; it will be denoted by P (4) 


and will be called the power function of the test. When the true mean 
253 


$123 TESTS OF HYPOTHESES 


u is far to the right of mo, the power will be nearly one, while when u 
is near po, the power will be small; at u = yo the power becomes equal 
to the Type I error, the probability that x falls in the critical region 
when u = po. The function is plotted in Fig. 59. 

In view of the fact that the test we are considering is a uniformly 
most powerful test, we can make the following statement about its 
power function: the power function of any other test with the same 
probability of a Type I error will lie entirely below the curve of Fig. 59 
(except, of course, that it will have the same value at uo. ‘The general 


Pl) 


Lor a 


0.05 


y 
Fre. 59. 


problem of studying tests can be set up in terms of the power function. 
For one parameter we may consider the test of the null hypothesis: 


Ho: @ = b 


for the parameter of a density f(x; 0), where the possible values of 6 
lie in some interval which may be finite or infinite. In Fig. 60 are 
plotted several power functions for fixed Type Lerror. If a test exists 
which has a power function such as Pi(0), then we have a very fine 
test indeed, and it can be shown that such tests can be obtained in 
general for large samples. For small Samples, power functions are 
more likely to look like P.(@) and P;(8). And generally speaking, 
there will be no absolute criterion for choosing between tests, The 
test represented by P.(9) is better than that represented by P;(6) for 
9 >) and fora < 6 <b. But the test represented by P3(6) is better 
forð < a and b < 0 < 6,. 

The situation just described is typical. It will be possible to set up 
tests which are best for certain alternatives to Ho but which are poor 
for other alternatives, and other tests will be better for these other 

254 


SIMPLE AND COMPOSITE HYPOTHESES §12.4 


alternatives. The choice of a test must depend on the particular 
problem at hand and on the end one is most anxious to gain by the 
test. Thus, for example, if one had to make a choice between the two 
tests represented by P2(@) and P;(@) in Fig. 60, he would choose P3(0) 
if he wanted to be fairly certain to reject Ho when 0 was quite far from 
6 in either direction. But P2(@) would be chosen if he were particu- 
larly concerned with the alternatives 0 > 6». 

We may mention here that an unbiased test is one such that its 
power function has a minimum at 0 = bo. The test represented by 
P.(@) is biased. There are values of 0 (just to the left of 60) for 
which the probability 1 — P(@) of accepting the null hypothesis is 
larger than for the null hypothesis itself. 


P(e) 
1.0 


! 

I 

I 

I 

I 

I 

| 

| 

H 

| E 
— 

a b & 4 


Fra. 60. 


12.4. Simple and Composite Hypotheses. We turn now to hypo- 
theses involving distributions with several parameters, and we may 
consider the general density f(x; 61, 0s, * * * > 0,. The distribution 
may have several variates z, y, 2, * * ` without in any way changing 
the ensuing development. The parameter space with coordinates 
61, 02, * + * , 0, will be denoted by the Greek letter Q. A particular 
distribution in the family of distributions will be represented by a 
point in Q. Thus if numerical values 910, 920, * * * » Oro are substi- 
tuted for 61, 02, © © * , % in f(x; 6 0 * ^ * , 6), a specific distribution 
function is determined. The numerical values (010, O20, * * * + Oro) 
may be thought of as the coordinates of a point in a k-dimensional 
space. Thus the family of normal distributions with 


gm 


fain) =a 


255 


* 812.4 A TESTS OF HYPOTHESES 


may be represented by the upper half plane of Fig. 61. The coordi- 
nates of any point in the plane determine a particular member of the 
family. This upper half plane is 9 for the given family. 

A simple null hypothesis is one which states that a distribution is one 
specific member of a given family. A composite null hypothesis is 
one which states that a distribution belongs to some subspace of the 
parameter space. We shall be primarily interested in subspaces of 


u=-3+0?/2 


(6,4) 


Cosa m3 eA | Qm] aes $. 318 4 


Fie. 61. 


2 


lower dimensionality than that of Q. Referring to the two-parameter 
family of normal distributions, the null hypothesis: 


Hon =6, o=2 


is a simple hypothesis because it completely specifies a single distribu- 
tion in the family. The null hypothesis: 


Ao: p= —5 
is satisfied by all distributions with mean —5 regardless of the value of 
7?; this null hypothesis selects a subspace (the line u = —5) of the 


parameter space and is a composite hypothesis. Similarly 
Hs: dap 
OE Eloy 


is a composite hypothesis satisfied by all distributions with parameter 
values which satisfy the given relation. 

Of course a simple hypothesis which selects a single point of the 
parameter space may be regarded as a special case of a composite 
hypothesis, because a point is a subspace; we shall use the word com- 
posite only when the subspace has more than one point. The symbol 
w will be used to designate the subspace determined by the null 
hypothesis whether it is simple or composite. 

256 


2 


THE LIKELIHOOD-RATIO TEST 812.5 - 


For a general family of distributions f(x; 61, 0», * * * , 6;), a null 
hypothesis will state that the actual distribution belongs to some sub- 
space w of the complete parameter space Q. If w is a point, the 
hypothesis is simple; otherwise the hypothesis is composite. 

12.5. The Likelihood-ratio Test and Its Large-sample Distribution. 
There are many ways to set up tests of hypotheses, and the best test 
in any given situation depends on the form of the distribution function 
and what alternatives are considered to be of primary importance. 
We shall not be able to study all the various methods of constructing 
tests but shall confine our attention to one method which usually leads 
to a very good test. f 

The likelihood-ratio test is closely related to maximum-likelihood 
estimation and to the ratio test described in Sec. 2 for a single alterna- 
tive. Let a1, to 7 0, t be a sample of size n from a population 
with density f(x; 1, 02, °° * , 6). On the basis of this sample it is 
desired to test the null hypothesis: 


Ho: f(x; 01, 92, * * * , Ox) belongs to the subspace w of € 


The likelihood of the sample is 


L = [| fai; 01, O) > > > 5 8) (1) 
i=1 

The likelihood as a function of the parameters will ordinarily have a 
maximum as the parameters are allowed to vary over the entire 
parameter space 2; we shall denote this maximum value by L(61, 42, 
+ + + ,6,) or more briefly by L(&). In the subspace w, L will also have 
a maximum value which we shall denote by L(4). The likelihood ratio 
is the quotient of these two maxima and is denoted by 


LO) 
~ L(8) à @) 


This quantity is necessarily a positive fraction; L is positive because 
it is a product of density functions, and L(6) will be smaller than or at 
most equal to L(@) because there is less freedom for maximizing L in w 
than in Q. The ratio ^ is a function of the sample observations only; 
it does not involve any parameters. The range of the variate ^ is zero 
to one. 

An illustration will reveal the logie of using ^ as a test criterion. 
Let the family of distributions be the one-parameter family of normal 
distributions with unit variance, and let the sample consist of n obser- 

257 


812.5 TESTS OF HYPOTHESES 
vations £1, 2», * * - , z,. We shall test the null hypothesis 
H o: = 33 


that the population mean is actually three. This point is w while the 
whole y axis is Q. The likelihood is 


L= ee 7 Micra: 
V 2r, 


which may be written 
1 


I x ny g M XGi—2)1—(n/2) (Bp)? 
V or 


The maximum value of this quantity in Q is, of course, given by putting 
u = & to obtain 


1 n 
L(9) = {——) exe» 
©) = (75) 
Since in this example w is a point (the null hypothesis is simple), there 
is no opportunity to vary u and the largest value of L in w is simply its 
only value: 
1 n 
L(à)-[——)| e- 463-2) +n) 2-2) 
Delo) 


T, 


The likelihood ratio is then 


A = e(n) 0-3) 


If @ happens to be quite near 3, then the sample is reasonably con- 
sistent with Ho, and A will be near one. If Zis much greater than or 
less than 3, the sample will not be consistent with H o and ^ will be 
near zero. 

Clearly the proper critical region for testing Ho is an interval 


O<A<A 


where A is some number (less than one) chosen to give the desired 
control of the Type I error. 

This example illustrates the general situation. If the maximum- 
likelihood estimates fall in or near w, the sample will be considered 
consistent with Ho and the ratio A will be near one. If the estimate 
(61, 62, - - - , 6.) is distant from €, then the sample will not be in 
accord with Ho and À will ordinarily be small. The critical region for 
à will always be an interval of the form 0 < X <A. The number A 

258 


TESTS ON THE MEAN OF A NORMAL POPULATION §12.6 


will be determined by the distribution of \ and the desired probability 
of a Type I error. If that probability is to be .05, for example, and if 
the density of X is g(X) when Hp is true, then A is the number for which 


HE 4 g(A)dd = .05 


In order to prescribe the critical region for ^, it is necessary to know 
the distribution of \ when Hp is true. If Ho is a simple hypothesis 
[o is a point (10, 820, * * * , Oko), for example], then there will be a 
unique distribution determined for X. But if Ho is a composite, there 
may or may not be a unique distribution for à. It is quite possible 
that the distribution of A may be different for different parameter 
points in w, and in this case A will not be uniquely determined. To 
specify a test, it is necessary to add further arbitrary criteria into the 
method of constructing the test. We shall not investigate these 
problems; we merely wish to observe here that the likelihood-ratio 
method as far as we have described it does not always lead to a unique 
test. 

As is usually the case for large samples, a very satisfactory solution 
to the problem of testing hypotheses exists when one is dealing with 
large samples. The solution is based on a theorem which we shall not 
be able to prove because of the advanced character of its proof: 

If a density function f(x; 61, 0», * * - , Ox) satisfies conditions like those 
enumerated in Sec. 10.8, if the dimensionality of € is k, and if the dimen- 
sionality of w is r < k, then —2 log ^ is approximately distributed like 
chi square with k — r degrees of freedom for large samples when H is true. 
Since —2 log A increases as à decreases and approaches infinity as X 
approaches zero, the critical region for —2 log ^ is the right-hand tail 
of the chi-square distribution. Therefore if we are dealing with a 
large sample and wish to test a null hypothesis with probability .05 
for a Type I error, for example, it is only necessary to compute — 2 log ^ 
and compare it with the .05 level of chi square; if —2 log ^ exceeds the 
chi-square level, Ho would be rejected; otherwise Ho would be accepted. 

12.6. Tests on the Mean of a Normal Population. The foregoing 
ideas are well illustrated by a very common practical problem—that 
of testing whether the mean of a normal population has a specified 
value. We shall suppose that we have a sample of n observations, 


Tı, to, * * * , Zn, from a normal population with mean y and variance 
c?. We wish to test the null hypothesis: 
Ho: u = uo (1) 


where uo is a given number. The parameter space 9 is the half plane 
259 


§12.6 TESTS OF HYPOTHESES 


of Fig. 61. The subspace w characterized by the null hypothesis is the , 
vertical line u = po. ù 
We shall test Ho by means of the likelihood ratio. The likelihood is 


= y e V XI(zi-u) /o}? (2) D 


1 
pe 
(x 5 
We have already seen that the values of u and c? which maximize L 
in 9 are 


Substituting these values in L, we have s 


LÂ) = gm (3) 


1 n/2 
ler = z] 
To maximize L in w, we put u = po, and the only remaining parameter 
is o?; the value of c? which then maximizes L is readily found to be 


1 Y 
= A (zi rss Ho)? 
which gives 


L(0) = pee te (e/a) ( 
~ | rne aj t f 


The ratio of (4) to (3) is the likelihood ratio: 4 
[26 -aT 
ia E: = zl e 


Our next step is to obtain the distribution of A under Ho and use that 
distribution to determine a number A so that the critical region 
0 < ^ < A will give the desired probability, .05, for example, of reject- 
ing Ho when it is true. 

It happens that the distribution of ^ is easily obtained in this case. 
The sum of squares in the denominator of (5) may be put in the form 


(ti — um)? = Z(z; —2)? + n(& — m)? 
so that \ may be written 


pi { 1 ie 
1 + [n(& — u)?/Z(x; — 2)] 
260 


TESTS ON THE MEAN OF A NORMAL POPULATION 812.6 


We may recall (Sec. 11.2) that the fraction in the denominator is just 


TD 
where t has the £ distribution with n — 1 degrees of freedom when Ho 


is true. To obtain the distribution of à, we need merely to transform 
the ¢ distribution by the substitution 


1 n/2 S 

ks l Fea = mil Se 

It is not necessary actually to obtain the distribution of A, because 

it is a monotonic function of t? and the test can. be done just as well 

with # as a criterion as with X. Since 4? = 0 when à = 1 and ? 

becomes infinite when à approaches zero, a critical region of the form 

0 «X <A is equivalent to a critical region t° > B where B may be 

determined from A by equation (7). The critical values of t are 

therefore the extreme values either positive or negative, and a .05 
critical region for ¢ is the pair of intervals 


t < —Los and t > Los 


where tos is the number for which 
[iie n — 1)dt = .025 (8) 
Los 


f(t; n — 1) representing the ¢ distribution with » — 1 degrees of 
freedom. The test of Ho may therefore be performed as follows: 
We compute the quantity 


vnin — 1) = Ho) (9. 
VJ d(x — &)* 
If it lies between —t.os and t.os, Ho is accepted; otherwise Ho is rejected. 
It is worth while to observe the connection between this test and the 
confidence-interval estimate of the mean. Supposing the mean of 
the population sampled to be p^, a 95 per cent; confidence interval for 
u' is just the set of values p for which 


MV n(n = 1) (E FE D) 10 
E DIY, co NN < Los (10) 


Hence the test of Ho is equivalent to the following test: Construct 


a confidence interval for the population mean. lf po lies in the con- 
261 


§12.6 TESTS OF HYPOTHESES 


fidence interval, accept Ho; if xo does not lie in the confidence interval, 
reject Ho. 

We may also observe that the theorem at the end of the preceding 
section gives the correct distribution of ^ for large samples. Since 


—2 log ^ = n log (s sn i (11) 


and since the series expansion of log (1 + z) is 


2 3 
log(Lta)=e-F+2P_—T 4... for —1 <x <1 (12) 


we have 
n i n 2 


n 
e Uu esl DAE 


Ti tu esr 7 @— 132 


for any fixed value of t, however large, provided n is taken large enough 
to make /?/(n — 1) less than one. The first term of this series is 


F (t;a) 


Mo ” &-t/£(x;-X)- 
Fra. 62. 


essentially 2, and the others approach zero as n becomes large. Hence 
for large n, —2 log ^ is approximately ?. Furthermore ¢ is approxi- 
mately normally distributed for large samples (Sec. 11.7) with zero 
mean and unit variance if the true mean is Ho; hence ¢? has approxi- 
mately the chi-square distribution with one degree of freedom. This 
is in accord with the theorem, since 9 is a plane and has k = 2 dimen- 
sions, while w is a line and has r = 1 dimension. 

One-tailed Tests on the Mean. The test we have just constructed is 
called the two-tailed test of the mean, referring to the fact that the 
critical region is composed of both extremes of the £ distribution. The 
test is not a uniformly most powerful test, and in fact there is no 
uniformly most powerful test for the given null hypothesis. If we 
consider a single one of the alternatives to Ho, E = m, for example, 
where #1 > uo, the two £ distributions are represented in Fig. 62. The 
best critical region for £, given a .05 probability of a Type I error, is 
obviously the interval £ > £1, which cuts off 5 per cent of the area 

262 A 


MEANS OF TWO NORMAL POPULATIONS §12.7 


under f(t; uo) on the right-hand tail. This will be the best critical 
region for any value of u greater than wo. The power P,(u) of this test 
is plotted in Fig. 63 together with the power P2(u) of the two-tailed 
test. The one-tailed test is certainly better than the two-tailed test 
for alternatives u > po, and it is a uniformly most powerful test for 
those alternatives. But for alternatives u < po, the one-tailed test 
is no good at all; the power (probability of rejecting uo when x is the 
true mean) approaches zero as u moves away from jo towards the left. 

There are many practical situations in which the one-tailed test 
should be employed. We may refer again to the light-bulb example 
used earlier in which the standard manufacturing process produced 
bulbs with a mean life of about 1400 hours. Any proposed new process 


Fi. 63. 


is of interest only if it produces bulbs with a greater mean life. One 
would test the null hypothesis u = 1400, and use the one-tailed test. 
Certainly no harm would be done by accepting » = 1400 if in fact u 
were less than 1400, because the proposed process would simply be 
abandoned in either case. In other problems, the left-hand one-tailed 
test might be the appropriate test. For example, a new process might 
be thought to reduce the mean production cost per unit; one would 
test the null hypothesis that the mean cost 8 for the new process was 
equal to the mean cost 6, for the standard process against the alterna- 
tives 0 < 0o. If one were comparing two proposed processes and 
wanted to choose the better for further research and development, 
then the two-tailed test would be appropriate. 

12.7. The Difference between Means of Two Normal Populations. 
In many situations it is necessary to compare two means when neither 
is known; in the preceding section we assumed one was known. If, for 


example, one wished to compare two proposed new processes for manu- 
263 


§12.7 TESTS OF HYPOTHESES 


facturing light bulbs, he would have to base the comparison on esti- 
mates of both process means. In comparing the yield of a new line 
of hybrid corn with that of a standard line, one would also have to use 
estimates of both mean yields because it is impossible to state the mean 
yield of the standard line for the given weather conditions under which 
the new line will be grown. It is necessary to compare the two lines 
by planting them in the same season and on the same soil type, and 
thereby obtain estimates of the mean yields for both lines under sim- 
ilar conditions. Of course the comparison is thus specialized; a com- 
plete comparison of the two lines would require tests over a period of 
years on a variety of soil types. 

The general problem is this: We have two normal populations— 
one with variate xı which has mean y; and variance o?, and one with 
variate və which has mean u» and variance c2. On the basis of two 
samples, one from each population, we wish to test the null hypothesis: 


Ho: ii = p 


The parameter space Q here is four-dimensional; a joint distribution of 

xı and z» is specified when values are assigned to the four quantities 

(ui, uo, 01, 03). The subspace w is three-dimensional because values for 

only three quantities (u», oł, c3) need be specified in order to specify 

completely the joint distribution under the hypothesis that Hi = ue. 
We shall suppose that there are m observations Gat, isp yin) 

in the sample from the first population and n observations (x21, 229, 

* , tən) from the second. The likelihood is 


(1) 


2 
2, 


L -( 1 y dame 1 Ron) 


ELI 
2ro? 2mo: 


and its maximum in Q is readily seen to be 


LÔ) 2s is m m/2 2 n n/2 gene eum (2) 
2r Y (zu — m)? 2r Y (wa; — Ta)? 
1 X 


If we put 1 and pe equal to p, say, and try to maximize L with respect 

to u, of, and o, it will be found that the estimate of “is given as the root 

of a cubic equation and will be a very complex function of the observa- 

tions. The resulting likelihood ratio À will therefore be a complicated 

function, and to find its distribution would be a tedious task indeed. 

No one has, in fact, worked out this distribution, and there is not much 
264 


MEANS OF TWO NORMAL POPULATIONS 812.7 


incentive to do so because the distribution would very likely involve 
the ratio of the two variances. If it did involve this ratio, then it 
would be impossible to determine a critical region 0 < à < A for given 
probability of a "Type I error, because the ratio of the population 
variances is ordinarily unknown. A number of special devices can 
be employed to cireumvent this difficulty, but we shall not pursue the 
problem further because statisticians are not yet agreed on what is the 
best procedure. Of course, for large samples this criterion may be 
used. The root of the cubic can be computed in any given instance by 
numerical methods, and ^ can then be calculated. The quantity 
—2 log ^ will have approximately the chi-square distribution with one 
degree of freedom. 

When it can be assumed that the two populations have the same 
variance, the problem becomes relatively simple. The parameter 
space Q is then three-dimensional with coordinates (ui us, o°), while 
w, for the null hypothesis #1 = us, has two coordinates, c? and the 
common mean p. In 9 we find 


fi = % fis = Xe 


URN y i 
6? = —— |Y (eu — 2)? + ) Cym i)? 
m+n p l ; | 
so that 
2 zt "i (m+n)/2 omen) 2) 
^ -[m-n) 7? / 
cu fore — E)! By — x) à S 
In o 
"EA Sa )- mi + ny 
h^ ae “ae COmEEUR 
Rx 1 EISE ;— pi)? 
Y marae p) Ys a) | 
mM» mn T zi 
=m E n [> (zu — 3)? + » (zu; — 2)? c T n deum 2] 
which gives 
LO = 
Mesh owe gone al 


Ye. = zc? (Gg ge cm [671 -1))| 
265 ž 


§12.7 TESTS OF HYPOTHESES 
and finally 
1 (m-+n)/2 
* = (5) 
#1 — Eo)2 
Re mai) 


Z(zu — d)? + D(x; — Fe)? 


This last expression is very similar to the corresponding one obtained 
in the preceding section, and it turns out that this test can also be 
performed in terms of a quantity which has the ¢ distribution. We 
know that %, and z; are independently normally distributed with 
means u; and yz and with variances c?/m and c?/n. Referring to 
Prob. 25 of Chap. 10, it is readily seen that u = Zi — Z; is normally 
distributed with mean pi— us and variance o°{(1 /m) + (1/n)]. 
Under the null hypothesis the mean of u will be zero. The quantities 
Z(vxu — £:)?/o? and Z(zs; — z2)?/o? are independently distributed by 
chi-square laws with m — 1 and n — 1 degrees of freedom, respec- 
tively; hence their sum, say v, has the chi-square distribution with 
m +n — 2 degrees of freedom. Since under the null hypothesis 


t LEA 
a V (1/m) + (1/n) 


is normally distributed with zero mean and unit variance, the quantity 


z 
Vo/(m +n — 2) 
ue v/mn/ (Qn + n) (% — z) 
Vilu — 21)? + Eltz — z9))]/(m + n — 2) 
has the ¢ distribution with m + n — 2 degrees of freedom. — The likeli- 


hood ratio is 
1 (m4-n)/2 

M res | oi 
and its distribution is determined by the £ distribution. The test 
would, of course, be done in terms of £ rather than X. Possible 5 per 
cent critical regions for £ are again £ < —to, t > tro, or P? > ths, and 
the choice between these would depend on the problem at hand. If, 
for example, the first population referred to the yield of a variety of 
corn in common use, while the second referred to the yield of a pro- 
posed substitute, the critical region would be £ < —io. If one were 
comparing two proposed substitutes, the two-tailed test given by 
# > ths would be used. 


(6) 


266 


TESTS ON THE VARIANCE OF A NORMAL DISTRIBUTION §12.8 


We may observe here that it is possible to determine a confidence 
interval for the difference pı — pe of the population means by using the 
t distribution. When the two means are different, the quantity 


Fr eS (ui — Me) 
c A/ (1/m) + (1/n) 


is normally distributed with zero mean and unit variance so that 


y 


M/v/ (m + n — 2) 


has the £ distribution with m + n — 2 degrees of freedom. Since t 
does not involve c? but only the parameter 0 = pı — M2, a confidence 
interval for 0 can be obtained. Upper and lower limits, for a 95 per 
cent eonfidence interval, for example, would be obtained by solving 
the equations 


t= 


t= thos 
for 6. 
12.8. Tests on the Variance of a Normal Distribution. To test the 
null hypothesis that the variance of a normal population has a speci- 
fied value cj on the basis of a sample of size n, we first maximize 


1 n/2 
L= (23) elein) /a\? ( iL ) 


in 9, which has coordinates (u, o°), and in w, which is the line o? = of. 
The ratio of these maxima is readily found to be 


u n/2 
=[- -Y(u—n) 
Q. : 
where 
n 3 
wu 2) 
u= 2 aa (8) 


Since u is known to have the chi-square distribution with n — 1 
degrees of freedom, the distribution of à could be found by transform- 
ing the chi-square distribution by (2). The test may, however, be 
done using u as a criterion. On plotting equation (2) (Fig. 64), it is 
seen that a critical region for à of the form 0 <A <A corresponds 
to the pair of intervals 0 < u < a andb < u < œ for u, where a and 


b are such that the ordinates of (2) are equal. 
267 


§12.8 TESTS OF HYPOTHESES 


It can be shown that the power of this test will be slightly improved 
if in the criterion (2), n is replaced by n — 1, i.e., if 


(n—1/2 
Pu -( V ) gin) (4) 


nou 


is used as the test criterion. We shall not prove this statement; it is 
an unimportant refinement unless n is small. Using M, the critical 
region for u'would be determined by numbers a’ and b’ , Say, such that 
the ordinates of (4) were equal at those points. Since the chi-square 
distribution is not tabulated in sufficient detail to determine these 
numbers, it is common practice to use X < U < x%5 as the accept- 


Fia. 64, 


ance region rather than a’ < u <b’, if, for example, the probability 
of a Type I error is specified to be .05. Here again there are some 
situations in which one of the one-tailed tests, u < x5, or u > x^ 
would be preferred over the two-tailed test. 

Equality of Two Variances. Given samples from each of two normal 
populations with means and variances (us, 03) and (us, o3), we may test 


Hy: 0? = oi 
The likelihood ratio is found to be 


| m+n (m-+n)/2 
— L2al2@u — 21)? + Xo — #3] 


(5) 


[ m igi n d 
2r£(v — 2i)? 2r E(t; — T)? 


where the notation is the same as that of the precedingseetion. This 
268 


TESTS ON THE VARIANCE OF A NORMAL DISTRIBUTION §12.8 


criterion may be put in the form 


m—1,\" 
(m a m) (z = 1 r) 


A (6) 
mm/yr! (m+n)/2 
( + TRAC r) 
mE 
where F is the variance ratio: 
— Dz(m.- 2) 
jue (n — 1) Z(zu — £1) (7) 


(m — 1) Z(zs — T)? 


which has the F distribution with m — 1 and n — 1 degrees of freedom 
when His true. On plotting X as a function of F, it is apparent that 
the critical region 0 < à < A corresponds to a two-tailed test on F. 
It is customary to make the two tails have equal areas (though this is 
not quite the best test) because the tabulations of F make this region 
easy to determine. Again one-tailed tests are often appropriate in 
problems of this kind. 

Equality of Several Variances. A problem that frequently arises in 
applied statistics is that of testing whether several normal popula- 
tions have the same variance. Let tin Zi» * * * , Zim be a sample of 
size n; from a normal population with mean 4; and variance oj, and 
let there be one such sample from each of k populations (7 = 1, 2, 
- ++ k). It is easily found that the likelihood ratio criterion for 
testing 


dotis. iot 
Hyoj-0$—0$— = e 


is 


k 
JI (S; [nii 
L i=l 
X= (3873n)5 e 


where 
nt 


Si = » (wiz — Ti)? 
jal 
Equation (8) is the direct generalization of (b). The distribution of 
à is a complicated function, and from the applied point of view it is 
of no use because it would not be feasible to tabulate the function 
anyway. It contains k parameters ni, na, * * * , k and would have 
to be tabulated for all possible combinations of values of these param- 
eters for every value of k. When the n; are large, the criterion does 
provide a test because —2 log à will then have approximately the 
269 


§12.9 TESTS OF HYPOTHESES 


chi-square distribution with k — 1 degrees of freedom under Ho. 
The number of degrees of freedom is k — 1 because 2 has 2k dimen- 
sions [the joint distribution of the x, is specified when (ui, us, ©- j 
Hk, 91, 02, *  * , ok) are specified], while w has k + 1 dimensions 
corresponding to the k means and the common variance. 

It turns out that the test may be made even when the n; are not 
large. The distribution of —2 log X has been investigated and found 
to be well approximated by the chi-square distribution with k — 1 
degrees of freedom in any case. The approximation is even better 
and the test somewhat improved if, instead of —2 log ^, the criterion 


—2 log X 


IR UT dd 1s |S 
if 1 1 
Lg i-i) 


is employed, where M represents the expression (8) with all the n; 
replaced by n; — 1. The quantity —2 log ^ gives a slightly biased 
test, and u has been defined so as to make the test unbiased. The 
critical region for the test is, of course, the right-hand tail of the chi- 
square distribution; a two-tailed test is never appropriate here. 

12.9. The Goodness-of-fit Test. If a population has the multi- 
nomial density 


k 
Fesp) = Tl p# m: = 0, 1; Bx, = 1; Bp, = 1 (1) 


t=1 


(9) 


as would be the case in sampling with replacement from a population 
of individuals which could be classified into k classes, a common prob- 
lem is that of testing whether the probabilities have specified numerical 
values. Thus the result of casting a die may be classified into one of 
six classes. On the basis of a sample of observations, we may wish to 
test whether the die is true, i.e., whether 


Peers | Fors bp as. s 


Let us suppose that n observations are drawn from a population 
with distribution (1) and that the number of observations that fall 
in the 7th class is n;(Zn; = m). The likelihood of the sample is 


k 
L = [I pë (2) 
1 


and we shall test the null hypothesis 


Ho: pi = poi 
270 


THE GOODNESS-OF-FIT TEST 812.9 


where the po; are given numbers. The parameter space Q bhas k — 1 
dimensions (given k — 1 of the p;, the remaining one is determined by 
Dp; = 1), while w is a point. It is readily found that L is maximized 
in Q when 


$m G) 
hence 
12 
LO = = J] nt o 
1 


In o the maximum value of L is its only value 


k 
L6) = II vis (5) 
1 


The likelihood ratio is 


k p ni 
0i 
x Sas] (2) (6) 

1 - 
and the critical region is 0 < ^ < A, where A is chosen to give the 
desired probability of a Type I error. For small n, the distribution 
of à may be tabulated directly in order to determine A; for large 
values of n, we may use the fact that —2 log ^ has approximately the 
chi-square distribution with k — 1 degrees of freedom. The chi- 
square approximation is surprisingly good even if n is small provided 
that k > 2. 

Another test commonly used for testing Ho was proposed (by Karl 
Pearson) before the general theory of testing hypotheses was devel- 
oped. This test criterion is 

ipee ( 
n poi 
which in large samples has approximately the chi-square distribution 
with k — 1 degrees of freedom when Hy is true. The argument for 
using (7) as a criterion is briefly this: The approximate large-sample 
distribution of the f: = nn (i = 1, 2, * * ^ k — 1) is normal and 
is in fact 


i 1 2 
INDE -1g yn (245, hepa bi— A 
JOur Ba, rpe (2) LI est 
II» 


§12.9 i TESTS OF HYPOTHESES 


as follows from equation (10.9.18) on replacing k by k — 1. We have 
seen in Chap. 10 that the quadratic form of a multivariate normal dis- 
tribution in k — 1 variates has the chi-square distribution with k — 1 
degrees of freedom; hence 


eli (+ x) (b= 09: — ») ©) 


has that distribution approximately for large samples. On summing 
(9) with respect to j and remembering that 


k-1 
B= 1 — Yn 
1 
we find : 
"PRI .)2 
p= V i — p) Q0). 
ia pi 1 
or 
E ( NI i)? 
pies YS. mer (11) 
: i 


which is the same as (7) if the true values of the p: are po: But let us 
suppose ‘the true p; are pw, at least some of which are different from 


Poi; then 
ow npr)” 
= 2 
R Hess ^O np € ) 
has approximately the chi-square distribution with expected value 
k — 1. The quantity ` 
T (ni = npo)? 
Rr eo 


is easily shown to have an expected value 
1 
E(u) = Die [npu(1 — pi) + n*(Px — po)?] (14) 


which is certainly larger than k — 1 for sufficiently large n, and in fact 
is larger than k — 1 for any n, because if E(u) is minimized ah respect 
to the poi, it is found that the minimum occurs when po; = pi; and is 
therefore k — 1. The argument for using u as a test criterion is now 
evident. If the true p; are po; u will have the chi-square distribution 
approximately, while if the true p; are not po, u will be distributed 


with a larger mean value, and that mean value becomes infinite as n 
272 


TESTS OF INDEPENDENCE IN CONTINGENCY TABLES $12.10 


becomes large. Hence it is reasonable to test Hy by using u as a 
criterion and the right-hand tail of the distribution as the critical 
region. 

We have discussed Pearson's chi-square criterion because of its 
historical interest and because it is still commonly used to test Ho: 
It is, in fact, equivalent to the likelihood-ratio test in large samples. 
Perhaps the easiest way to show this is to write \ in the form 


n! a 
= K yall ve 


where 
n” Tin 


dim If: n! 


If the variates of (8) are changed from f; to n;, the function will be 
unchanged except for the change in factor n*-?/* since n df; = dni. 
It follows from Sec. 10.9 that A»^—/K approaches (8). By using 
Stirling’s formula (Sec. 2.3) for the factorials in K, it can be shown 
that K/n*-' just cancels the coefficient of the exponential in (8) to 
within terms of order 1/+/n; hence —2 log ^ is asymptotically equiva- 
lent to u. 

12.10. Tests of Independence in Contingency Tables. A contin- 
gency table is multiple classification. Thus in a public-opinion survey 
the individuals interviewed may be classified according to their atti- 
tude on a political proposal and according to sex, to obtain a table of 
the form: 


‘ Favor | Oppose | Undecided 


This is a 2 X 3 contingency table. The individuals are classified by 
two criteria, one having two categories and the other three cate- 
gories. The six distinct Classifications are called cells. A three-way 
contingency table would have been obtained had the individuals been 
further classified according to a third criterion, say according to annual 
income group. If there were five income groups set up (such as: under 
$1000, $1000 to $3000, - + + ), the contingency table would be called 
a 2X 3 X 5 table and would have 30 cells into which a person might 
be put. It is often quite convenient to think of the cells as cubes in a 


block two units wide, three units long, and five units deep. If the 
273 


§12.10 TESTS OF HYPOTHESES 


individuals were still further classified into eight geographical locations, 
one would have a four-way 2 X 3 X 5 X 8 contingency table with 
240 cells in a four-dimensional block with edges 2, 3, 5, and 8 units 
long. The contingency table provides a technique for investigating 
suspected relationships. Thus one may suspect that men and women 
will react differently to a certain political proposal, in which case he 
would construct such a table as the one above and test the null hypo- 
thesis that their attitudes were independent of their sex. To consider 
another example, a geneticist may suspect that susceptibility to a 
certain disease is heritable. He would classify a sample of individuals 
according to (1) whether or not they ever had the disease, (2) whether 
or not their fathers had the disease, (3) whether or not their mothers 
had the disease. In the resulting 2 X 2 X 2 contingency table he 
would test the null hypothesis that classification (1) was independent 
of (2) and (3). Again a medical research worker might suspect a 
certain environmental condition favored a given disease and classify 
individuals according to (1) whether or not they ever had the disease, 
(2) whether or not they were subject to the condition. An industrial 
engineer would use a contingency table to discover whether or not 
two kinds of defects in a manufactured product were due to the same 
underlying cause or to different causes. It is apparent that the tech- 
nique can be a very useful tool in any field of research. 

Two-way Contingency Tables. We shall suppose that n individuals 
or items are classified according to two criteria A and B, that there 
are r classifications Ai, As, - - + , A, in A and s classifications Bu, 
Ba ***, B, in B, and that the number of individuals belonging to 
A; and B; is mj. We have then an r X s contingency table with cell 
frequencies n; and Em; =n. Asa further notation we shall denote the 


Bı | Bı | Ba | -+> | B, 
Ai | nu | nia | mis | ++ | niu 
Aa | ai | 2 | nas| - + + | Nnu 
As | nsi | nsz | na| * + + | nas (1) 
Ay | nri | Nra | Tos | ++ + | fn 


TESTS OF INDEPENDENCE IN CONTINGENCY TABLES §12.10 


row totals by ni. and the column totals by n.i, 


ni. = y Nij nj = » Nij 
j ; 
Xn. = DLE =n 


i d 


Of course 


We shall now set up a probability model for the problem with which 
we wish to deal. The n individuals will be regarded as a sample of 
size n from a multinomial population with probabilities p (i = 1, 2, 


orig md, D , $). The probability distribution for a single 
observation is (Sec. 10.9) 
f(zu, $15" >> Tre) = TI pz ay = 0,1; XM =l (2) 
ij ag 


We wish to test the null hypothesis that the A and B classifications 
are independent, i.e., that the probability an individual falls in B; is 
not affected by the A class to which the individual happens to belong. 
Using the symbolism of Chap. 2, we would write 
P(BjA) = P(B)  P(A4B) = P(A’) 
or 
P(A, Bj) = P(A)P(G) 


If we denote the marginal probabilities P(Ai) by p: @ = 1,2, °° + ,7) 
and the marginal probabilities P(Bj by qi, the null hypothesis is 
simply 

Ho: pp = pq 2 = 1, Zy = 1 (3) 
When the null nypothesis is not true, there is said to be interaction 
between the two criteria of classification. 

The complete parameter space 9 for the distribution (1) has rs — 1 
dimensions (having specified all but one of the p; the remaining one is 
fixed by 5 pi = 1), while under Hy we have a parameter space w with 

j 


r—1--s-— 1 dimensions. The likelihood for a sample of size n is 


L= II py e 


and its maximum in Q occurs when 


pi = = (5) 


§12.10 TESTS OF HYPOTHESES 


In o, 
L= [I pa)" = (Mo) (Il) (6) 
ij i j 
and its maximum occurs at 
A Ni. A "n4 
D ico (7) 


The likelihood ratio is therefore 
(Ine) (i) 
pee aA 


n^ I] ngi 
ij 


A (8) 


The distribution of X under the null hypothesis is not unique because 
the hypothesis is composite and the exact distribution of \ does involve 
the unknown parameters p; and gj. For large samples we do have a 
test, however, because —2 log ^ is, in that case, approximately dis- 
tributed by the chi-square law with 


te SU nckous 2) (rd) 81 


degrees of freedom, and on the basis of this distribution a unique 
critical region for \ may be determined. 2 

In casting about for a test which may be used when the sample is 
not large, we may inquire how it is that a test criterion comes to have 
a unique distribution for large samples when the distribution actually 
depends on unknown parameters which may have any values in certain 
ranges. The answer is that the parameters are not really unknown; 
they can be estimated, and their estimates approach their true values 
as the sample size increases. In the limit as n becomes infinite the 
parameters are known exactly, and it isat that point that the distribution 
of à actually becomes unique. It is unique because a particular point 
in c is selected as the true parameter point, so that the Mij are given a ` 
unique distribution, and the distribution of à is then determined by 
this distribution. 

It would appear reasonable to employ a similar procedure to set up 
a test for small samples, i.e., to define a distribution for X by using the 
estimates for the unknown parameters. In the present problem, since 
the estimates of the p; and q; are given by (7), we might just substitute 
those values in the distribution function of the ni; and use that distri- 
bution to obtain a distribution for X. However we should still be in 
trouble; the critical region would depend on the marginal totals ni. 

276 


"sie 


TESTS OF INDEPENDENCE IN CONTINGENCY TABLES $12.10 


and n,;; hence the probability of a Type I error would vary from sample 
to sample for any fixed critical region 0 < à < A, 

There is a Way out of this difficulty which is well worth investigation. 
because of its own interest and because the problem is important in 
applied statistics. Let us denote the joint density of all the ni; briefly 
by f(n;j), the marginal density of all the n;, and n; by g(ni., n.i), and the 
conditional density of the jj, given the marginal totals, by 


Fa) 
g (Nin n.i) 
Under the null hypothesis, this conditional distribution happens to be 
independent of the unknown parameters (as we shall show presently); 
the estimators n;,/n and n.;/n form a sufficient set of statistics for the 
p: and q; This fact will enable us to construct a test. 

The joint density of the ni; is simply the multinomial 


Fafnin, n.) = 


n! 
f(nu Mays Nn) = Tnall pi“ (9) 
ij 
in Q, and in w (we are interested in the distribution of à under Ho) this 
becomes 


. n! » 7 
fray m$ 0m) = ml sy qd) a9 
ij 
To obtain the desired conditional distribution, we must first find the 


distribution of the n;, and n, and this is accomplished by summing 
(10) over all sets of n; such that 


» Nij = NG » Nij = Mi. (11) 
i j 


For fixed marginal totals, only the factor 1 /Tn;;! in (10) is involved 
in the sum, so we have in effect to sum that factor over all ni; Subject 
to (11). The desired sum is given by comparing the coefficients of 
[I 2%" in the expression : 


(Bf e e) ce cos S SOIT (uda 
= (ai +:-. +2)" (2 


On the right the coefficient of Ta? is simply 


n! 13 
TCR (13) 


277 


$12.10 TESTS OF HYPOTHESES 


On the left there are terms in IIx?" with coefficients of the form 


II nj! 


na! na! Dae, (14) 


I ns! TI nis! IH»! Iny! 
i i i P 


where n,; is the exponent of x; in the jth multinomial. In this expres- 
sion the n;; satisfy the conditions (11); the first condition is satisfied 
in view of the multinomial theorem (Sec. 2.4), while the second is 
satisfied because we require the power of 2; in these terms to be ni. ' 
The sum of all such coefficients (14) must equal (13); hence we may 
write 


1 n! 
» Tri! Iin! ng! qu) 
i i 


This is precisely the sum we require, because there is obviously one 
and only one coefficient of the form of (14) on the left of (12) for every 
possible contingency table (1) with given marginal totals. "The dis- 
tribution of the n;, and n; is, therefore, 


qe, n) = ar ors (ILor- (IL) (16) 


which shows incidentally that the n; are distributed independently 
of the 7,;; this is unexpected because nı, and n.1, for example, have the 
variate 11; in common. 

The conditional distribution of the n;;, given the marginal totals, is 
obtained by dividing (10) by (16) to obtain 


(In; !) B i!) 


nin; ao 


f(nas, 2019; t Ey Nre|N1,5 fig, °° * n) = 
which, happily, does not involve the unknown parameters and shows 
that the estimators are sufficient. 

To see how a test may be constructed, let us consider the general 
situation in which a criterion \ for some test has a distribution u(r; 0) 
which involves an unknown parameter 0. If @ has a sufficient esti- 
mator 6, then the joint density of À and 6 may be written 


v(, 6; 0) = v1(X)os(0; 0) (18) 


and the conditional density of X given 6 will not involve 0.. Using the 
278 


TESTS OF INDEPENDENCE IN CONTINGENCY TABLES §12.10 


conditional distfibution, we may find a number A (6) for every Ê such 
that 
if 4 AÂ) = .05 


for example. In the A, 6 plane the curve ^ = A (6) together with the 
line A = 0 will determine a region R. The probability that a sample 


a 


will give rise to a pair of values (^, 0) which correspond to a point in 
R is exactly .05 because 
PIO, ô) in RI = [fy vo, 6; ana (19) 
-{- ew v(nld)ar | v2(6; 40 
Sola 
JE: 05v2(6; 0)dô 
ED 


Hence we may test the hypothesis by using 6 in conjunction with A. 
The critical region is a plane region instead of an interval 0 < ^ < A; 
it is such a region that whatever the unknown value of 0 may be, the 


Type I error has a specified probability. The test in any given situ- 


ation actually amounts to a conditional test; we observe 6 and test » 


by an interval 0 <A < A(8) using the conditional distribution of X 
given à. It is to be observed that this device cannot be employed 


unless 0 has a sufficient estimator. h 

The above technique is obviously applicable when 0 is a set of 
parameters rather than a single parameter and has a set of sufficient 
estimators 6. In particular the technique may be employed to test 


the criterion (8) for the null hypothesis of a two-way contingency table. 


One merely uses the conditional distribution (17) and determines an 


279 


$12.10 TESTS OF HYPOTHESES 


interval 0 < A < A(n;; n;) which has the desired probability of a 
Type I error for the observed marginal totals. 

In applications of this test, one is confronted with a very tedious 
computation in determining the distribution of X unless 7, s, and the 
marginal totals are quite small. It can be shown, however, that the 
large-sample approximation may be used without appreciable error 
except when both r and s equal two. In the latter instance, other 
simplifying approximations have been developed (see, for example, 
Fisher and Yates, “Tables for Statisticians and Biometricians,” 
Oliver & Boyd, Ltd., Edinburgh, 1938), but we shall not explore the 
problem that far. 

If the distribution (17) is replaced by its multivariate normal 
approximation, it can be shown that the criterion 


ON [ng — (nin;/n)]? 
AS » TW, n.;/n (20) 


has approximately the chi-square distribution with (r — 1)(s — 1) 
degrees of freedom and is a reasonable criterion for testing H of (3). 
This is the criterion first proposed (by Karl Pearson) for testing the 
hypothesis, and it differs from —2 log ^ by terms of order 1/4/n. 
The two criteria are therefore essentially equivalent unless » is small. 
The argument that u is a reasonable criterion is entirely analogous to 
that used to justify (7) in the preceding section. 

Three-way Contingency Tables. If the elements of a population can 
be classified according to three criteria 4, B, C with classifications 
Ay @ = 1, 2,--+, 8), Bj (391,2, -- , s, and Cs (k = 1, 2, 

* , 83), a sample of n individuals may be classified in a three-way 
$1 X s2 X ss contingency table. We shall let p; represent the proba- , 
bilities associated with the individual cells, "ix be the numbers of 
sample elements in the individual cells, and, as before, marginal totals 
will be indicated by replacing the summed index by a dot; thus 


i 82 a 82 
Nik = Y Nijk Tipi Y 2 Nik (21) 
j=1 i21j-21 


There are four hypotheses that may be tested in connection with 
this table. We may test whether all three criteria are mutually 
independent, in which case the null hypothesis is 

Piit = pidi"k (22) 


or we may test whether any one of the three criteria is independent 
280 


NOTES AND REFERENCES §12.11 


of the other two. Thus to test whether the B classification was 
independent of A and C, we would set up the null hypothesis 


Piik = Pikdi (23) 


The procedure for testing these hypotheses is entirely analogous to 
that for the two-way tables. The likelihood of the sample is 


L-|m? Yrw=-1 iw-n (24) 
ijk ijk a 
In 9 the maximum of L occurs when 
Niji 
Piik = E (25) 
so that 
1) = E e (26) 
Mie 


To test (23), for example, we would make the substitution (23) in (24) 
and maximize L with respect to the paq; to find 


A i.k Ni. 
fa- = y= (27) 


and 


ro = 2 (II mt’) (r) (28) 
ik J 


The likelihood ratio ^ is given by the quotient of (28) and (26), and in 
large samples —2 log ^ has the chi-square distribution with 


818283 — 1 — [(si83 — 1) +s- 1] = (siss — 1) (s2 — 1) 


degrees of freedom. Again the large-sample distribution is quite ade- 
quate for all practical purposes unless the test has only one degree of 
freedom. 

12.11. Notes and References. Itisnow apparent that the sampling 
distributions based on normal theory have an all-important role in 
statistical inference, both in estimation and in tests of hypotheses. 
We shall cite here the classic references. 

The chi-square distribution is due to Karl Pearson [1], who was the 
first major contributor to the theory of statistics. Pearson published 
nearly one hundred papers from about 1895 to 1935 which laid a firm 
foundation for modern statistics. He formulated the basie problems 
and went far along the way to solving many of them. He is rightly 


regarded as the founder of the science of statistical inference. 
281 


812.12 TESTS OF HYPOTHESES 


We have already mentioned that W. S. Gosset first showed the way 
to make an exact inference. Before his paper [2] was published, the 
accepted method of making inferences was to substitute estimates for 
parameters in population distributions. Gosset was the second major 
contributor to the field of statistics; he published about twenty papers 
in this field between 1908 and 1931. 

The F distribution was derived by R. A. Fisher [3], who also gave 
the first mathematical derivation of the ¢ distribution [4]; Gosset had 
obtained it by heuristic methods. Fisher is the real giant in develop- 
ment of the theory of statistics. His first paper was published in 
1912, and his work continues unabated today. Although hundreds of 
scholars have contributed to the science of statistics, this one man must 
be credited with at least half the essential and important developments 
as the theory now stands. 

The general theory of testing hypotheses, as we have presented it, 
is due to J. Neyman and E. S. Pearson (the son of Karl Pearson), 
who published the theory in an important series of joint papers begin- 
ning in 1928 [b] Many earlier workers, particularly Fisher, had 
carried this problem far, but one crucial ingredient of the theory (the 
power of a test) was missing until Neyman and Pearson supplied it. 


1. Karl Pearson: “On a criterion that a given system of deviations 
from the probable in the case of a correlated system of variables 
is such that it can reasonably be supposed to have arisen in 
random sampling,” Philosophical Magazine, Vol. 50 (1900), p. 
157. 

2. "Student" (W. S. Gosset): “The probable error of a mean," 
Biometrika, Vol. 6 (1908), p. 1. 

3. R. A. Fisher: “The frequency distribution of the values of the cor- 
relation coefficient in samples from an indefinitely large popula- 
tion," Biometrika, Vol. 10 (1915), p. 507. 

4. R. A. Fisher: “Applications of ‘Student's’ distribution,” Metron, 
Vol. 5, No. 3 (1925), p. 90. 

5. J. Neyman and E. S. Pearson: “On the use and interpretation of cer- 
tain test criteria for purposes of statistical inference," Biometrika, 
Vol. 20A (1928), pp. 175 and 263. 


12.12. Problems 


1. Given the sample (—0.2, —0.9, —0.6, 0.1) from a normal popu- 
lation with unit variance, test whether the population mean is zero 


at the .05 level of significance (i.e., with probability .05 of a Type I 
282 


PROBLEMS $12.12 


error). Test whether the mean is zero at the .05 level relative to 
alternatives u > 0. 

2. Given the sample (—4.4, 4.0, 2.0, —4.8) from a normal popula- 
tion with variance four and the sample (6.0, 1.0, 3.2, —0.4) from a 
normal population with variance five, test at the .01 level whether the 
means are equal relative to alternatives for which the mean of the 
first population is smaller than the mean of the second. 

3. A metallurgist made four determinations of the melting point 
of manganese: 1269, 1271, 1263, 1265 degrees centigrade. Are these 
in accord with the published value of 1260 at the .05 level? (Assume 
normality.) 

4, How would one make a two-sided test of p = Ho for a normal 
population with known variance? Is this a uniformly most powerful 
test? 

5. Plot the power function for two-sided tests of the null hypoth- 
esis u = 0 for a normal distribution with known variance using 
sample sizes 1, 4, 16, 64. (Use the standard deviation ø as the unit of 
measurement on the y axis, and .05 probability of Type I error.) 

6. What is the best critical region R in the sample space (zi, 2», 

- , n) for testing the null hypothesis that the mean is uo against 
the alternative that the mean is pı for a normal population? 

7. Referring to Prob. 6, what would be the region for testing 
between two values of the variance, cj and oj? 

8. In testing between two values, uo and m, for the mean of anormal 
population, show that the probabilities for both types of error can be 
made arbitrarily small by taking a sufficiently large sample. 

9. A cigarette manufacturer sent each of two laboratories pre- 
sumably identical samples of tobacco. Each made five determinations 
of the nicotine content in milligrams as follows: (a) 24, 27, 26, 21, 24 
and (b) 27, 28, 23, 31, 26. Were the two laboratories measuring the 
same thing? (Assume normality and a common variance.) 

10. The metallurgist of Prob. 3, after assessing the magnitude of 
the various errors that might accrue in his experimental technique, 
decided that his measurements should have a standard deviation of 
about 2 degrees. Are the data consistent with this supposition at the 
.05 level? (Use a one-sided test, o > 2.) 

11. Test the hypothesis that the two samples of Prob. 9 came from 
populations with the same variance at the .05 level. 

12. The power function for a test that the means of two normal 
populations are equal depends on the values of the two means, pi and 
us, and is therefore a surface. But the numerical value of the function 

283 


§12.12 TESTS OF HYPOTHESES 


depends only on the difference 0 = y; — yo, so that it can be adequately 
represented by a curve, say P(6). Plot P(0) when samples of four 
are drawn from one population with variance two, and samples of two 
are drawn from another population with variance three for tests at 
the .01 level. 

13. Given the samples (1.8, 2.9, 1.4, 1.1), (5.0, 8.6, 9.2), (3.8, —4.1, 
0.8) from normal populations, test whether the variances are equal at 
the .05 level. 

14. Given a sample of size 100 with Z = 2.7 and X(z; — 2)* 225 
test the null hypothesis: 


Hyp =3 and o = 2.5 


at the .01 level assuming the population is normal. 

15. Using the sample of Prob. 14, test the hypothesis that p = c? 
at the .01 level. 

16. Using the sample of Prob. 14, test at the .01 level whether the 
95 per cent point « of the population distribution is three relative to 
alternatives œ < 3. The 95 per cent point is the number o such that 


oe f(x) dz — .95, where f(x) is the population density ; it is, of course, 


u + 1.645c in the present instance where the distribution is assumed 
to be normal. 

17. Verify equations (8.5) and (8.6). 

18. Verify equation (8.8). 

19. Given the sample of Prob. 14 together with a sample from a 
second normal population of size 80 with z = 2.2 and Z(x-—2z)z- 
320, test whether the means are equal at the .05 level. (The required 
root of the cubic equation encountered here is 2.56.) 

20. In making two-sided tests of 0 — %, one does not ordinarily 
merely reject @ when the test criterion falls in the critical region; he 
usually states that 0 < 0 or that 0 > > depending on which is indi- 
cated by the result of the test. In this situation there is a third error 
possible: one may declare 0 < 6, when in fact 0 > 0o, or vice versa. 
Plot the probability of such a gross error as a function of (u — po)/o 
in the situation described in Prob. 4 for samples of size four and for 
probability .05 of a Type I error. 

21. A sample of size n is drawn from each of k normal populations 
with the same variance. Derive the likelihood-ratio criterion for test- 
ing the hypothesis that the means are all zero. Show that criterion 
is a function of a ratio which has the F distribution. 

284 


— dix 


PROBLEMS §12.12 


22. Derive the likelihood-ratio criterion for testing whether the 
correlation of a bivariate normal distribution is zero. 

23. If a1, 2s, * - - , z, are observations from normal populations 
with known variances c?, o2, * * * , o2, how would one test whether 
their means were all equal? 

24. A newspaper in a certain city observed that driving conditions 
were much improved in the city because the number of fatal automobile 
accidents in the past year was 9, whereas the average number per year 
over the past several years was 15. Isit possible that conditions were 
about the same as before? Assume number of accidents in a given 
year has a Poisson distribution. : 

25. Six 1-foot specimens of insulated wire were tested at high voltage 
for weak spots in the insulation. The numbers of such weak spots 
were found to be 2, 0, 1, 1,3, 2. The manufacturer's quality standard 
states that there are less than 120 such defects per 100 feet. Does the 
batch from which these specimens were taken conform to the standard 
at the .05 level of significance? (Use the Poisson distribution.) 

26. A psychiatrist newly employed by a medical clinic remarked at 
a'staff meeting that about 40 per cent of all chronic headache sufferers 
were of the psychosomatic variety. His disbelieving colleagues mixed 
some pills of plain flour and water, giving them to all such patients 
on the clinie's rolls with the story that they were a new headache 
remedy and asking for comments. When the comments were all in, 
they could be fairly accurately classified as follows: (1) better than 
aspirin, 8; (2) about the same as aspirin, 3; (3) slower than aspirin, 1; 
(4) not worth the powder to blow them to hell, 29. While the doctors 
were somewhat surprised by these results, they nevertheless accused 
the psychiatrist of exaggeration. Did they have good grounds? 

27. Supply the details of the argument in the last paragraph of 
Sec. 9. 

28. A die was cast 300 times with the following results: 


Occurrence......-.++++++ 1 2 3 
Frequency........-++++++ 43 49 56 45 66 41 


4 
Are the data consistent at the .05 level with the hypothesis that the die 
is true? 

29. Of 64 offspring of a certain cross between guinea pigs, 34 were 
red, 10 were black, 20 were white. According to the genetic model 
these numbers should be in the ratio 9:3:4. Are the data consistent 


with the model at the .05 level? 
285 


§12.12 THSTS OF HYPOTHESES 


30. A prominent baseball player’s batting average dropped from .313 
in one year to .280 in the following year. He was at bat 374 times 
during the first year and 268 times during the second. Is the hypoth- 
esis tenable at the .05 level that his hitting ability was the same during 
the two years? 

31. Find the mean and variance of n;; in the conditional distribution 
(10.17). 

32. Show that the expected value of u defined by (10.20) is n(r — 
1)(s — 1)/(n — 1) under the conditional distribution (10.17). 

33. Using the data of Prob. 30, assume that one has a sample of 374 
from one binomial population and 268 from another. Derive the ^ 
criterion for testing whether the probability of a hit is the same for the 
two populations. How does this test compare with the ordinary test 
for a 2 X 2 contingency table? 

34. The progeny of a certain mating were classified by a physical 
attribute into three groups, the numbers being 10, 53, 46. According 
to a genetic model the frequencies should be in the ratios p?:2p(1 — 
p):(1 — p)*. Are the data consistent with the model at the .05 level? 

35. A thousand individuals were classified according to sex and 
according to whether or not they were color-blind as follows: 


Male | Female 


NOXIA Lee ai 442 514 


According to the genetic model these numbers should have relative 
frequencies given by 


[3 


+ pq 


bobo ws 
wl oS, 


whereg — 1 — pisthe proportion of defective genes in the population. 
Are the data consistent with the model? 
36. Treating the table of Prob. 35 asa 2 X 2 contingency table, test 
the hypothesis that color blindness is independent of sex. 
37. Gilby classified 1725 school children according to intelligence and 
apparent family economic level. A condensed classification follows: 
286 


PROBLEMS 812.12 


Dull | Intelligent | Very capable 


Very well clothed.........- 81 322 233 
Well clothed..............- 141 457 153 
Poorly clothed., n. ses. essea 127 163 48 


Test for independence at the .01 level. 

38. A serum supposed to have some effect in preventing colds was 
tested on 500 individuals, and their records for 1 year were compared 
with the records of 500 untreated individuals as follows: 


No colds | One cold | More than one cold 


252 103 


224 


Treated. -isis idoneae 
Untreated. s.e seesi 


Test at the .05 level whether the sets of probabilities for the two 
trinomial populations may be regarded as the same. 

39. Derive the general ^ criterion for testing for independence in 
an r X s table when one set of marginal totals (the row totals, for 
example) are fixed in advance as in Prob. 38. Each row is regarded as 
a sample from an s-fold multinomial population with probabilities pi; 
such that, pi; = lforalli. The hypothesis of independence becomes: 

3 


Py = py = py = °° * = Dri for allj. How many degrees of freedom 
does —2 log ^ have? 

40. According to the genetie model the proportion of individuals 
having the four blood types should be related by: 


0:4? 

A: p* + 2pq 

B: r? + 2qr 
AB: 2pr 


where p +q+7r=1. Given the sample: O, 374; A, 436; B, 132; AB, 
58; how would you test the correctness of the model? 

41. Given cell frequencies nj (i = 1:25 Ri Fume 2) ee 
k = 1,2, `- - t) ina three-way classification, derive the criterion for 
testing whether all three criteria of classification are independent. 


How many degrees of freedom does —2 log \ have? 
287 


ares 


$12.12 TESTS OF HYPOTHESES 


42. Galton investigated 78 families classifying children according to 
whether or not they were light-eyed, whether or not they had a light- 
eyed parent, whether or not they had a light-eyed grandparent. The 
following 2 X 2 X 2 table resulted: 


Grandparent 


Light Not 


Parent 


Light Not Light Not 


S Lght........... 1928 552 596 508 
[si boot ng Eee 303 395 225 501 


Test for complete independence at the .01 level. Test whether the 
child classification is independent of the other two classifications at the 
-01 level. 

43. Derive the X criterion for testing whether the ¢ classification is 
independent of the jk classification in a three-way contingency table 
when the marginal totals n;, are fixed in advance. The probabilities 


satisfy the relations Y pi, = 1 for all 7, and the null hypothesis is 
7i 


Piik = Pojk = ttt = pu or simply Pik = Pik 
How many degrees of freedom does —2 log \ have? 

44. Derive the test for complete independence in the situation 
described in Prob. 43, The null hypothesis is Pys = pg. How many 
degrees of freedom does —2 log A have? How does this test compare 
with that for the case in which the n;. are not fixed in advance? 

45. Compute the exact distribution of À for a 2 X 2 contingency 
table with marginal totals nı, = 4; ns, = 7; ni = 0; as = 5. What 
is the exact probability that —2 log exceeds 3.84, the .05 level 
of chi square for one degree of freedom? 


€———ÀÉ 


CHAPTER 13 
REGRESSION AND LINEAR HYPOTHESES 


13.1. Families of Populations. In this chapter we shall study 
special cases of a situation which may be described as follows: A 
family of populations has a set of variates (which may be symbolized 
by x whether or not there is only one variate), a set of parameters 6, 
which are in general unknown, and a set of parameters z, which are 
usually observable and known for a given sample. The parameters 0 
may or may not be functions of the parameters z. If they are functions 
of z, the functions will in general be unknown. We shall consider the 
problem of making inferences about the parameters 0 on the basis of 
samples drawn from populations with different values of z. The 
family of density functions may be represented by 


f(x; 0, 2) 


We shall select populations with known values ofz and draw samples 
from each of these populations. Thus we shall deal with collections 
of samples: zy (j = 1,2, + * , m) forz = zi; zu (j = 1,2, * + + , na) 
for 2 = 23° * * 3 tm; (J = 1, 2, * * * , Mm) for Z = z,. We may, of 
course, draw only one observation from each population, in which case 
the observations could be represented by (zi 21), (vs 22), * 7 7, 
(Em, Zm). On the basis of such collections of observations on v and 2, 
we may estimate certain of the parameters 0 or test hypotheses about 
the parameters 0. 

This general problem may be illustrated by considering the distribu- 
tion of heights of individuals. A person's height may be expected to 
be related to his father’s height z and his mother’s height z’. Let us 
assume that for parents with given heights, childrens’ adult heights will 
be normally distributed with means (2, z/ and variances c? inde- 
pendent of the z's, i.e., that heights x have densities 


e 1/202) [z—u(2,2)]* ( 1) 


. 2 MM names 
TG; n, 0,2, 2) = NEUE 
Here we have a variate x, a pair of parameters z, 2’ which can be 


observed, and a pair of unknown parameters y and c*, one of which is 
289 à 


§13.1 REGRESSION AND LINEAR HYPOTHESES 


regarded as a function of the observable parameters. By measuring 
the heights of children and parents in several families with more than 
one child, we may, for example, test the hypothesis that the function 
form of a(z, 2’) is 

p=atbe+ cz’ (2) 


where the a, b, c are unknown constants. If this hypothesis is accept- 
able, we may further wish to estimate the unknown parameters a, b, c. 

'To consider another example, the velocity of an object falling from 
rest in air may be expected to depend on the length of time ¢ it has 
been falling, on its weight w, and on certain other parameters s specify- 
ing its size and shape. Again the distribution of velocities might be 
assumed to be normal with mean y and variance o?, both of which may 
be functions of the observable parameters £, w, s. On the basis of a 
sample of observed velocities together with the corresponding values 
of the observable parameters, one might, for example, test certain 
hypotheses about the forms of the unknown functions u(t, w, s) and 
c*(L, w, s). 

These problems are regression problems. They are sometimes 
referred to as prediction problems. Thus in the first example, after 
the parameters a, b, c, and ce? are estimated, one may predict with about 
95 per cent certainty that the children of a couple with given heights 
Zo 4j would have heights between 


a + bzo + cz, — 1.960 and -a + bzo + cz, + 1.960 


if the estimates were based on a large sample. The accuracy of a pre- 
diction depends largely on the size of the prediction interval which in 
the present instance depends on the error variance c?, In the case of 
a falling body, the error variance is so small under certain conditions 
that the velocity can be predicted almost exactly (the length of a 
95 per cent prediction interval is small enough to be negligible for most. 
practical purposes). In the case of predicting heights of children, the 
prediction interval would not be small relative to the mean a + bzo 
+ ez. 

Regression problems occur in great variety in all sciences, both 
natural and social. In fact, from one point of view the whole aim of 
science in general is to predict (on the basis of past experimental work) 
what will happen in a given circumstance. 

We shall be concerned with a special case of the general regression 
problem which, however, has very wide application. We shall deal 
with normal distributions in which the mean is a function of the 

290 


(i 


SIMPLE LINEAR NORMAL REGRESSION §13.2 


observable parameters. The variance of the normal distribution will 
be assumed to be independent of the observable parameters. The 
mean p(z), where z is the set of observable parameters, is called the 
regression function; the function would represent a curve if z consisted 
of one parameter, a surface if z consisted of two parameters, a hyper- 
surface for more than two parameters. 

13.2. Simple Linear Normal Regression. A variate x is normally 
distributed about a regression function which is linear in a single 
observable parameter; the variance is independent of that parameter. 
The density is 


g- (1/20) [2—( at Bz) |* (1 ) 


1 
fei B es) = g 
We shall deal with the one-parameter family of normal distributions 
for which a, 8, c? are fixed. The family is represented in Fig. 65; for 
any given value of z, z is normally distributed with mean a + fz 
and variance c?. 


f(x) 


Zi 22 = 


x=atfz 


Fra. 65. 


We shall consider first the estimation of o, 8, o°. Let (25. 205 
i—1,2,--- ,n, bea sample of z's together with the corresponding 
values of z. Some of the z; may be equal, as would be the case if 
more than one z value were drawn from any specific distribution. It 
is convenient to label the z's differently even when some of them are 
the same. It is necessary that there be at least two different values 
of z however. Obviously one cannot expect to estimate o and 8 
from a sample drawn from a single member of the family of distribu- 

291 


$13.2 REGRESSION AND LINEAR HYPOTHESES 


tions. The method of maximum likelihood will be employed to esti- 
mate the parameters. The likelihood is 


1 e (1/201) [zi— (oct B2]? (2) 


VILI E 


and its logarithm is 


i oa! tdt NEC. aa SE 
log L = — Slog 2r — 5 log o 2) [5 — (a+ Be) (8) 
On putting the derivatives of this expression with respect to a, f, 
c? equal to zero, we obtain the relations 


no? = X(n — a — Ba)? (4) 
(zi — a — bz) = 0 (5) 
Ze(x — a — pz) = 0 (6) 


which must be solved for the unknown parameters. The last two 
equations are called the normal equations which determine the coeffi- 
cients in a linear regression function. They are linear in « and B and 
therefore readily solved. We shall let 


felja i-lYa D 


The solutions of (5), (6), and (7) may then be written 


a Ba aa- 3) 

PU ee 2 
&-z-— pz (9) 
# = IY (e — a — Ba)? (10) 


which are the required point estimators of the unknown parameters. 
We notice that the solution could not be carried through if all the 2; 
were equal because the denominator of (8) would vanish, 

Distribution of the Estimators. Since & and B are linear functions of 
the x; (which are normally distributed), it follows that & and B must 
themselves have a bivariate normal distribution. One could specify 
that distribution by simply finding the Means, variances, and covari- 

` ance of the @and Ê. We shall, however, find the distribution another 
way. The main objective is to show that @ and Ê are distributed 
292 


SIMPLE LINEAR NORMAL REGRESSION §13.2 


independently of ¢, and in doing this their distribution will fall out 
incidentally. ý 
We shall evaluate the joint-moment generating function: 


f= 


né? 
: 


NE 
m(81, $2, 83) = xe Fluss ce LE) (11) 


ZEE 


= n ,@-a, f=8 4,28 V ciapa) 
f (5j Tou Vn PY: Bzi) I] de: (12) 


2mc? 


for the three variates (å — a)/o, (Ê — B)/c, and nó*/c?. The first 
step in evaluating the integral is to transform the variates x; to 


—- 


yi = i (a; — e — Ba) (13) 


this removes the factor 1/o” from the integrand and changes the expo- 
nent in the integrand of (12) to 


n n u 
Y ocu — 16 Y, uyi (14) 
i-i ij-l 
where 
- si(Zd/n) — 22] + s — 2) _ i81 + bise (15) 
D(z — 2)? 
and 


of = 6;(1 — 2s) + 2sa[naia; + nz(aib; + ajbi) + bib;223] (16) 


where the a; and b; are defined by (15) and ài; is one or zero according 
as iis or is not equal to j. We have then to evaluate an integral of 


the form 
o 2 T n/2 M 
d TM T gXoi A2 Xetiuri TL avi (17) 


which, apart from a factor s/|c"|, is just the integral in equation (9.6.2) 
with the £s put equal to zero, The value of that integral is given in 


(9.6.4), and it follows that 
m(s, 82, 83) = erunt Ivi] (18) 


The algebraic reduction of (18) may be accomplished as follows: 
Since 
a; + zb; = 1 


n 
293 


§13.2 REGRESSION. AND LINEAR HYPOTHESES 


equation (16) may be written 


1 
cii = Ml — 285) + 2ss | b) @ — 2)? + d (19) 
or 
Fae 28) RIA c ui J (20) 
where 
4-2 
Vale: — 8j (21) 
so that Ed; = 0 and Ed? = 1. It is not difficult to verify then that 
le?| = (1 — 2s3)77? (22) 
and that the elements of the inverse of the matrix ||c*]| are 
po LEO mee ees 235 d 
DET PUE o: c i3 ;) (23) 


These last two relations enable one to put (18) in the form 
E e/2xGi-3) 7:270) Eat—-?imnarbest) 


m(S1, $ $9) = (1 = 2837-972 


(24) 


The form of the moment generating function (24) enables one to 
draw several important conclusions. Remembering that s; is asso- 
ciated with (& — a)/c, s» with (Ê — B)/e, 83 with nó?/c?, we observe 

1. That the pair of variates 4 and ĝ are distributed independently of 
é? because m(s1, 8», s3) factors into a function of s; alone and a function 
of sı and s; alone (see Sec. 10.4). We shall let 


m/(s1, 82, $3) = ma(sy, 82) ma(ss) (25) 


2. That the functional form of mi(si, 82) is that of the moment 
generating function for a bivariate normal distribution (Sec. 9.6); 
hence à and B are jointly normally distributed with means a and £, 
respectively, and variances and covariances 


gai Gm DE (26) 

a s? 7 

o = a-a (27) 

cov (@, B) = — e (28) 


294 


5 SIMPLE LINEAR NORMAL REGRESSION §13.2 


The inverse of the matrix of these variances and covariances is 


n/c? nz/o? 
nz/c? Xe/e?| 


(29) 


which are the coefficients of the quadratie form in the distribution of 
(4 — a) and (Ê — B). 
3. That @ and B will be independently distributed if the z; are chosen 
so that 2 = 0. i 
4. That the quadratic form of the joint distribution of à and 8, 


Q = L[na- a)? + sa — 90 — 9 +48 — 8)*] G0 


has the chi-square distribution with two degrees of freedom. 

5. That m»(ss) is the moment generating function for a chi-square 
distribution with n — 2 degrees of freedom, hence that nó?/c* has that 
distribution (Sec. 10.3). : 

Confidence Regions and Tests of Hypothesis. In regression problems 
the main interest is usually in the regression coefficients a and f. 
Of course there is no trouble in estimating o? or in testing hypotheses 
about c?, because the chi-square distribution of 5 above provides con 
fidence intervals and tests directly. 

To obtain a confidence interval for a, we need only to observe that 
the marginal distribution of à is normal with mean a and variance 


given by (26); hence 


has à normal distribution with zero mean and unit variance. Since u 
and né2/o? are independently distributed, it follows from See. 10.6 that 


gu 
IS RUP RET 
n(n — 2)Z(zi — 2)? 1 
= (&-— a) Sama = Gee a Bay: (31) 


has the £ distribution with n — 2 degrees of freedom. Since a is the 
only unknown quantity in this expression, the inequalities in 
PH HE 


may be converted to obtain a confidence interval with fiducial proba- 
bility 1 — e for a. The quantity ¢ also provides a test criterion for 
295 


§13.2 REGRESSION AND LINEAR HYPOTHESES 


testing hypotheses about æ in just the same way it does for the mean 
of a normal distribution (Sec. 12.6). Thus to test whether the regres- 
sion line x = a + fz passes through the origin in the z, z plane, we 
should simply put æ = 0 in (31) and observe whether |/| < t, if the 
level of significance is to be e. One-tailed tests may also be made. 

Confidence intervals for 8 and tests on 8 may be made in a quite 
similar way. It is readily seen that 


@— 223 — 2? 


er ea D(x; — å — Bz)? 


(32) 
also has the 7 distribution with n — 2 degrees of freedom and involves 
only the unknown parameter 8. To test, for example, whether the 
means of the family of normal distributions under consideration were 
independent of the observable parameter, one would put 8 = 0 in 
(32) and observe whether |] < t where e is the chosen significance 
level. 
For simultaneous estimation of « and 8, we may use the fact that 
dut iQ 
ies nó/g? 
where Q is defined by (30), has the F distribution with 2 and n — 2 


degrees of freedom (section 10.5), and involves only the unknown 
parameters œ and 8. The inequality in 


PF <F)=1-e 


(33) 


is readily seen to define an elliptical confidence region in the a, 8 plane 
for a and 8. To test whether a and £ had certain specified values ao 
and ĝo, one would put a = ay and 8 = 8, in (33) and observe whether 
or not the resulting value of F exceeded F,. 

All these tests on a and 8 could have been obtained by the likelihood- 
ratio method. : 

It is worth observing that the accuracy of the estimation of a and 8 
depends on the choice of the z;. Thus the variance of & will be as 
small as possible when the z; are chosen so that Z = 0. For, since 


Z(z — 2)? ze — nz 


the least possible value for ca? (equation 26) is c?/n and occurs when 
2 —0. Evidently the confidence interval for æ will be shortest on the 


average for given n when Z = 0. The variance of d (equation 27) 
can evidently be made small by choosing widely separated values for 
296 


ee 


D 


~ Thi 


PREDICTION $13.3 


the z. In fact, if zı is the smallest practicable value of z and z is 
the largest, then 8 will be best estimated when all the sampling is 
done at those two values of z. It often happens in practice, however, 
that there is some doubt about the linearity of the regression function 
a + Bz, and it is desired to test for linearity. In this case it is neces- 
sary to have observations for more than two values of z. A test for 
linearity will be described in Sec. 14.2. 

13.3. Prediction. Let us suppose that a linear regression function : 
x = œa + ßz has been estimated by x = à + Bz on the basis of a - 
sample of n observations. We now wish to predict the value of x for 
some specified value of z, say Zo. Thus if x is son's adult height and z 
is father’s height, a sample of observations will provide estimates à 
and B for a linear regression function. A prospective father of height 
zo may wish to predict his son's height. The predicted height is, of 
course, to = å + fz. Or to consider a different problem: Let x be 
the demand for some commodity, and let z be the wholesale price of 
the item two months earlier, or the wholesale price of some ingredient 
or part of the item two months earlier. It is desired to predict the 
demand two months in advance of the present. From past records 
one may collect a set of pairs of observations (xi, 2i), where x: is the 
demand at a given time and 2; is the wholesale price two months pre- 
vious to that time, and estimate coefficients a and f of a linear regres- 
sion. If zo is the present wholesale price, then the predicted demand 
two months hence is a = & + feo. 

The worth of a prediction depends on the magnitude of its possible 
error, and we shall take account of that error by obtaining a prediction 
interval which is analogous to a confidence interval. The variate z is 
a random variable with a normal distribution having mean a + bzo 
and variance o? The predicted value zo = & + zo has two sources 
of error: in the first place à + zo is merely an estimate of the mean 
of x, and the ‘actual value of z may, of course, deviate from its mean; 
in the second place the estimated mean is subject to the random 
sampling errors inherent in à and B. Ifa, B, ande were exactly known, 
then a 95 per cent prediction interval for x would simply be 


a + bzo — 1.960 to a + Bzo + 1.960 


since the probability that x will fall within 1.967 of its mean is .95 for 
a normal distribution. Since all these parameters except 20 are 
unknown, we must attempt to set up an interval in terms of their 


estimates. 
297 


§13.3 REGRESSION AND LINEAR HYPOTHESES 


The variate 
u = z — å — Bay (1) 


is necessarily normally distributed since it is a linear function of the 
normal variates z, â, 8. The distribution of u is therefore known 
when its mean and variance are given. Since 


E(@@)=a+fe El) =a EÊ) =$ 
we have 
E(u) =0 


The variance of u is therefore 


ci = E(u?) 
= E@-a- Bao)? 
= ci + oa? + def? + 2zoH[(@ — a) (Ê — 8)] (2) 


remembering that x is independent of @ and B. 6? is simply o°, the 
variance of the normal distribution, and the other terms in (2) are 
given by (2.26), (2.27), (2.28), so that 


ou 


D% — 2)? 
[ete 
+1 = 2)! 

=o [2 — + E =i | (3) 


A 95 per cent prediction interval for u is just —1.960. to 1.960, but 
this still involves one unknown parameter ¢ which appears in ou. 
We can eliminate c by using the ¢ distribution. The variate w/ew is 
normally distributed with zero mean and unit variance and is distri- 
buted independently of né?/c?; hence 


A u/9u 
DU /né?/(n — 2)? 


has the ¢ distribution with n — 2 degrees of freedom and involves no 
unknown parameters. The inequalities in 


(4) 


P(—-t. <t<t) =1—e 


may be converted to determine a 100(1 — €) per cent prediction inter- 
E 298 


DISCRIMINATION §13.4 


val for x. The interval is given by 
P(@+ ba —A<x<&+fa+A)=1—e (5) 


2 n n+1 (zo — 2)? 
een | n +o] (6) 


Several properties of the prediction interval should be observed: 

1. The length of the interval is greater than 24; on the average 
regardless of how large a sample was used to estimate œ and 8. This 
is entirely reasonable because we are predicting a single observation x 
which is normally distributed with standard deviation c. 

2. The average length of the prediction interval increases as 2o 
moves away from Z. If it is possible, the values z; chosen for obtaining 
observations to estimate the parameters should be selected so as to 
have a mean value near 2o. 

3. The relation (5) holds only for a single prediction based on the 
estimates â, B, ê. One cannot use the estimated regression to make 
several predictions and expect (5) to remain true. The relation has 
meaning only if o, 8, « are reestimated each time a prediction on z is 
made. The probability statement takes account of sampling variation 
in the estimates as well as in z, and if the original estimates are used 
repeatedly (not allowed to vary), the statement cannot be effective. 

It is easy to generalize the above technique to take account of pre- 
diction of the mean of a sample of size m observed for z = zo. Let 
xi, 25, +++, £h be a sample of m observations at 2. with mean 2’. 
The mean of 


where 


v= — â — fa 


is zero, and its variance c? is the same as (3) except that (n + 1)/n is 
replaced by (1/m) + (1/n). The variate 


e 0/0» 
BP ZEE 


has the ¢ distribution with n — 2 degrees of freedom and involves no 
unknown parameters; hence it may be employed to construct a predic- 
tion interval for 7’. 

13.4. Discrimination. The discrimination problem is an estimation 
problem and is in a sense the reverse of the prediction problem. In 
prediction one wishes to predict x knowing zo on the basis of estimates 
of a, 8, o. In discrimination one wishes to estimate zo having observed 

299 


§13.4 REGRESSION AND LINEAR HYPOTHESES 


x. The general class of biological assay problems are of this character, 
Thus, for example, the concentration of a certain vitamin may be 
measured by observing the gain in weight of a week-old chick when its 
diet is augmented by daily doses of the vitamin for several days. A 
manufacturer of the vitamin might determine the strength of a new 
batch as follows: Let x be the gain in weight and let z be the con- 
centration. Using material of known concentration, he would feed 
several chicks with different concentrations z; (i = 1, 2, * * - , n) and 
observe their gains in weight x; At the same time other chicks would 
receive their vitamins from the batch with unknown concentration zo, 
and their gains in weight, say x; (j — 1, 2, - * - , m), would be 
observed. On the basis of these data it is desired to estimate the 
parameter zo. 

The general problem of classification is a discrimination problem. 
Anthropologists, for example, make measurements x on skulls of known 
age z, then estimate the age zo of a skull of unknown age with measure- 
ments x’. Taxonomists use the technique to discriminate between 
varieties of plants with quite similar appearance. 

Using the notation of the first paragraph and the model of Sec. 2, the 


likelihood of the observations zi, zs, * - * , z, and aj, 2], * * « , atii is 
1 min 
L= ll £7 0/203) Z(i—o—82))1— (1/26?) X(zj'—a—820)* (1) 
Vir c. 


and on differentiating the logarithm of this expression with respect to 
c?, a, B, zo in turn, one can readily determine the maximum-likelihood 
estimates of these parameters; they are 


` D(z: — 2) (z — 2) 


Za D(z: — 2 2) 
Q=2-— ĝz (3) 
= m " [> (a; — à — Bz)? + 2 (z; — 2] (4) 
eee ; a 6) 
where 
xd NS i CASI 7 
=z Da PI) DU TAS 


Equations (2) and (3) are the same as (2.8) and (2.9); equation (5) 
gives the desirgd point estimate of zy. 
300 


n 


MULTIPLE REGRESSION §13.5 
A confidence interval for zo is also easily set up. The "rubigo 
v-2z-—4-—fea (6) 


is normally distributed since it is a linear function of normal variates; 
its mean is zero, and its variance is 


E el nun M (7) 


just as was found in Sec. 3. The two sums in (4) both have chi-square 
distributions when they are divided by c?, the first with n — 2 and the 
second with m — 1 degrees of freedom. The two chi squares are 
independent since they are functions of independent samples; hence 
their sum has the chi-square distribution with m + n — 3 degrees of 
freedom. Furthermore the two chi squares are obviously independent 
ofv. It follows then that 


b. v/0s 8 
: / (m + n)8?/ (m +n — 3)e* (8) 


has the ¢ distribution and will provide a confidence interval for z, since 
that is the only unknown parameter which appears in (8). 

We have considered a very much simplified discrimination problem, 
but it is one which occurs frequently in practice. The more general 
problem has to do with the case in which each observation consists of 
several components (Ty, 95 ** * , a) which have a multivariate 
normal distribution with means a1 + Biz, o» + Bx, * * * , Ok + By. 
Given estimates of the a’s and 6’s on the basis of a sample of observa- 
tions (za Ta * * * , Tei), One wishes to estimate 2 for an observation 
(zio t20, * * * , jo). We shall have to omit this problem because it is 
very cumbersome to handle by elementary methods. 

13.5. Multiple Regression. We shall consider now a variate x 
which is normally distributed with variance g? and with a mean of the 
form azı + axe + - coco + me; the z’s are observable parameters, 
and we are concerned with the other parameters (the a’s and c°). We 
may wish to estimate the parameters or test certain hypotheses about 


the parameters. The density for a sample of size n is 


= (1/202) X (zi- Darzi) 
d i ( P ) (1) 


(Gr) 


301 


§13.5 REGRESSION AND LINEAR HYPOTHESES 
and the logarithm of the likelihood is 
1 2 
L= — log ot — 5) (x — Y apt) (2) 
i D 


We shall let the indices and j run from 1 to n, and the indices p, q, r, 
and s run from 1 tok. On differentiating L with respect to a,, we find 
that the @’s are determined by the following set of k normal equations 
(there being an equation for each value of q): 


Yeu (z: — 4s) = 0 (3) 
If we define a,, and y, by the relations 
apa = DE 
V= Yes 
the normal equations may be written 


X Gp, = ya (4) 
D 


The matrix of coefficients ||a,,|| may be inverted if its determinant does 
not vanish, and letting a”? represent the elements of the inverse 
matrix, the solution of (4) for the &’s may be written 


a = Y any, (5) 
@ 


as follows by multiplying both sides of (4) by a” and summing on q 
(see Sec. 9.2). The maximum-likelihood estimator of c? is 


dim DE — Z hn) (6) 


as follows from putting the derivative of L with respect to c? equal to 
zero and substituting the &’s for the a’s. 

Distributions and Confidence Regions. In considering the distribu- 
tion of the estimators, we observe that the ap are not functions of the 
random variables z; and that the Yq are linear functions of normally 
distributed variates and must therefore be normally distributed. We 
may determine the distribution of the âp by simply finding their means 

302 


MULTIPLE REGRESSION §13.5 
variances and covariances. The mean is 


E(@,) = E (Y, arty.) 
= Y are X Zai B (m) 
= » arı » Zai D» PR 


» APIA 0p 
r 


*M e 


= Ap (7) 
The covariance of ĉp and @, is 


E(&, — ap) (âa — &a) = E(Ap&q) — araa 


=E Q arzni) 0 azat) — Apa 
= BY (X, 2n) (X azu) Eleti) — araa (8) 
E(aiaj) = (> asui) G 9] 


where u and v run from 1 to k, and when i = j, 


E(x?) = © autui)” To 


When 4 ¥ j, 


On substituting these values in (8) and making reductions similar to 
those employed to obtain (7), one finds 


E[(& — ap) (@q — aq)] = a?*e? . (9) 


y — (1/203) Zap(&» — a5) (âa — a) 


The inverse of the matrix ||a?*c?| is ||ap,/e?]; hence the &’s have the 
me Pa (10) 


density 
1 k/2 a 
or 2 


FO 

Tt can also be shown that né?/c? has the chi-square distribution with 

n — k degrees of freedom and further that nó?/c? is distributed inde- 

pendently of the 4's. We shall omit the argument, which is somewhat 

complicated but entirely analogous to that used in Sec. 2 to obtain 
303 


§13.5 REGRESSION AND LINEAR HYPOTHESES 


the joint distribution of &, B, and å? in the case k = 2. From these 
facts it follows that any particular regression coefficient a, may be 
estimated by a confidence interval using the ¢ distribution; 4, — ap is 
normally distributed with zero mean and variance a?7e?; hence 


pe a (11) 


has the £ distribution with n — k degrees of freedom and involves no 
unknown parameters except a». A confidence region for the whole 
set of regression coefficients, a: o», * * *: , o, in a k-dimensional 
space may be determined by the inequality in 


P(F < Fis) =8 


where Fıs is the critical level for the F distribution with k and n — k 
degrees of freedom. The quadratic form in the exponent of (10) has 
the chi-square distribution with Æ degrees of freedom and is distributed 
independently of nó?/z?; hence 


F- (n — k) as (&, — ap) (Oy — aa) (12) 


kné? 


has the F distribution with k and n — k degrees of freedom. 

It may be instructive to compare the results obtained thus far in 
this section with those of Sec. 2 by putting k = 2, z; = 1, and identi- 
fying a, a2, 2» with a, 8, z, respectively. 

Prediction. Given estimates of the parameters a, and ø in (1), one 
may predict the value of z corresponding to a given set of values, 
Zo» Of the observable parameters. The predicted valye would of 
course be 


to = X A pop (13) 
P 
The prediction interval is set up by considering the variate 


u = £ — EĝZop 


which is normally distributed as it is a linear function of normally 
distributed variates. The mean of u is zero since both z and Zá,Zo» 
have expected value Xo;2»,. The variance of u is 


c2 = E(u?) (14) 
= E(t — Zapo)? + E[X(&; — a5)2f]? (18) 
304 


«ne eee 


LINEAR HYPOTHESES §13.6 


since x is independent of the @’s. The first term on the right of (15) is 
c?, and the second term is readily evaluated by means of (9). One 
finds 


o = 0? (1 + » a? %z0yZ09) (16) 
Da 
Thus the variate 
Fy OES a EY (17) 


has the ¢ distribution with n — k degrees of freedom (u being inde- 
pendent of 6?) and may be employed to define a prediction interval for 
“x since it involves no unknown parameters. 

13.6. Linear Hypotheses. Referring to the multiple regression let 
us consider how we might test the hypothesis that the regression 
coefficients ap have certain specified values ao. The null hypothesis is 


Hat gm atop (p= 1, 25 come) and  cec?»0 (1) 
and the alternatives are 
Ha: —© <a <o (p=1,2,°:*,k) and oF >0 (2) 


The subspace w has one dimension, while 9 has k + 1 dimensions. 
If the likelihood (5.1) is maximized in w and in Q, one finds the X 
criterion, after considerable algebraic reduction, to be 


1 

= DEG - BP e 
where F is the quantity in (5.12) with the o; replaced by aop. Hence 
the ^ test is equivalent to an F test, and large values of F correspond 
to small valves of à ; the null hypothesis would be tested by using the 
right-hand tail of the F distribution for the critical region. When the 
aop are zero, as is often the case, the double sum in the numerator of 
F may be reduced to the simple form, Zájyy, by substituting for a, 
from (5.5). 

A more commonly desired test is one which tests some but not all 
the regression coefficients. Let us suppose that we wish to test 
whether the coefficients a1, a» * * * , a» (m <k) have specified 
values ao, (u = 1,2, * * * , m) whatever the values of the last k—m 


of the o's. The null hypothesis is now 


Hy: — o <a «co(rem-4l:::,h 
Gu = an (um 1, m) c?»0 (4 
305 


$13.6 REGRESSION AND LINEAR HYPOTHESES 


while Ha is as specified in (2). We shall merely present the test 
without the derivation because of the complex algebraic reduction 
required. It becomes plausible if one considers the marginal distribu- 
tion of the &, (u = 1, * * - , m); this distribution is obtained by 
integrating out Qmj1, @mz2, * * * , & from (5.10). After the integra- 
tion there will remain a multivariate normal distribution, and the 
coefficients of the quadratic form will be, say, bu»/o?. The buy are 
obtained (Sec. 9.2) by striking out the last k — m rows and columns 
of a?* and inverting the result; i.e., 


[bud = len — w»-1052-::,m 
The quadratic form of the marginal distribution is 


» [C - du) (&, T1 ay) 
Q- E ———— © 
and it has the chi-square distribution with m degrees of freedom. 
Since Q is distributed independently of é?, the quantity 


Q/m 

UP oe 

nes né?/(n — k)e* (6) 
has the F distribution with m and n — k degrees of freedom. The ^ 
criterion for testing (4) turns out to be 


1 
^7 ESQ — BIFTA s 
with the ay’s substituted for the os in Q; hence F’ provides an equiva- 
lent test. 

We are now in a position to consider what is called the general 
linear hypothesis of normal regression theory. The problem is to test 
the hypothesis that the coefficients a, satisfy certain linear relations, 
say: 


Cuai + craz + + ° © + Curar = Cor 
Canai + Coas + ` coc + Cak = Coz 
Cmi01 + Cao + °° © + Cao = Com 


where m < k and the c's are given numbers. These equations may be 
written 


Sorsan o pLh2--,k uw=1,2,---,m (8) 
p 
306 


APPLICATIONS OF NORMAL REGRESSION THEORY §13.7 


We suppose that these m relations are independent, i.e., that it is not 
possible to obtain one of them by adding chosen multiples of the others. 

The null hypothesis that (8) is true may be reduced to the form of 
(4) by recasting the problem in terms of new parameters, say fi, f», 


+ + + , Bs, and new observable parameters, say wi, ws, * * * , Wr. The 
first m of the 8's are defined by putting 
Y Curap = Bu (9) 
p 


The independence of the relations (8) ensures that m of the o's can be 
solved for in terms of the remaining a’s and the u. Supposing the 
equations can be solved for the first m of the o's, the solutions are 


simply 
a, = Y, c (8, — Xem) (10) 


where u and v run from 1 to m and r runs from m + 1 to k, and where 
the c"" are the elements of the inverse of ||cus||. The remaining 8’s 
may be put equal to the remaining a’s: 


a, = B. pein, «++, k (11) 


These new parameters 6p are now substituted for the ap in the mean of 


gi 
Y aep = X [X c% (& — Xo&) | zu 67 a2) 
P rmn T T 
The new observable parameters are then taken to be the coefficients 
of the A’s in (12); i.e., Wp is the coefficient of p in (12): 


Wp = Y cra, p-12,:::,m 
3 (13) 
= Zp — Y Conky pom: 
uv 
The mean of z is now expressed in the form pwp, The null hypoth- 
esis becomes simply Bu = cou (u = 1, 2, * * * , m), the one already 
discussed as (4). 

13.7. Applications of Normal Regression Theory. The estimation 
and test procedures we have just developed have a very wide range of 
application. The reason for this is the completely arbitrary nature 
of what we have called the observable parameters. The z, may, for 
example, be artificial code variables. Thus, suppose in a fertilizer 


experiment to investigate the effect of nitrogen and potash on a given 
307 


813.7 REGRESSION AND LINEAR HYPOTHESES 

crop, the crop is grown on plots with different fertilizer treatments. 
4 

We may express the mean yield in the form Y a2, Let zı = 1 for 
1 


all plots; let z; be zero for those plots with no nitrogen and one for all 
plots with a given application of nitrogen; let z; be zero for plots with 
no potash and one for those with potash; and let z4 be zero for all plots 
except those treated with both fertilizers. Now ei represents the 
yield with no fertilizer, o the added yield due to nitrogen, as the 
added yield due to potash, a» + os + a4 the added yield due to both 
fertilizers. Having performed the experiment, we may estimate the 
a’s, and we may test various hypotheses. Thus to test whether potash 
has any effect, we set up the null hypothesis that it does not and test 
whether œs and a, are both zero. To test whether effects of nitrogen - 
and potash are strictly additive (that there is no interaction between 
nitrogen and potash), we would test whether o4 = 0. 

In another instance the 2, may represent functions of some variable. 
As an example, we may consider a time series. The average monthly 
prices of some agricultural product, eggs, for example, if plotted | 
against time over a period of years, will show rather erratic looking 
fluctuations but will have certain inherent regularities. There will 
be a trend of some kind—a smooth curve which may be thought of as 
representing the general character of the variation of price with time — 
apart from any fluctuations. Also there will be an annual cycle of 
sorts; the prices in a given year will usually be higher during the winter 
months than the summer months. A firm which stores eggs in large - 
quantity may wish to know, for example, whether the amplitude 
of the cycle is independent of the average price level from year to year. 
This question might be studied as follows: Let be the price, and let 
t represent timein months. The data consist of prices £i, 22, * * * ,@n 
at times! = 1,2, - - - , n. Over the period of time included, let us — 
suppose it is apparent that a quadratic function will fit the trend quite 
well enough. Then the following regression function might reasonably 
represent the trend and cycle if the null hypothesis (that the amplitude 
is constant) is true: 


up" 
ai + aol + ol? + o, sin T + es cos 27 


If the null hypothesis is not true, the amplitude might reasonably be 

supposed to be proportional to the general price level given by the 

trend, or more generally, to be some linear or quadratic function of the 
308 


THE METHOD OF LEAST SQUARES §13.8 
time. To take account of this possibility, terms like 


ND 2rt . Int 
ast sin 12 + at cos 15 + ost? sin 12 + asl? cos = 


would be added to the function given above. The zp are now defined 
by zı = 1, 2 =, ***, Zə = @ cos (2nt/12). The null hypothesis 
would be tested by testing whether the last four regression coefficients 
were zero. 

The observable parameters may be any functions of any number of 
variables. Thus, for example, a variate z may be known to be some 
function of two variables u and v, but the form of the function, say 
f(u, v), may be unknown, and the purpose of the experiment may be to 
investigate the form of the function in the neighborhood of some point 
(uo, v). It may be reasonable to suppose that the function can be 
adequately represented in this neighborhood by a quadratic function, 
i.e., by the first six terms of its series expansion: 


fuo, vo) + fa(uo, vo) (u — Uo) + foto, vo) w — vo) 
+ Milf (Uo, vo) (u — Uo)? + Suolo, vo) (u — U) (o — vo) 
+ folto, vo) (w — vo)*] 


where the subscripts indicate partial differentiation. One would 
6 


merely estimate the a’s in » ayzgy Where zj— l, £» = U — Uo, 
T 

Zp =v — vo 247 (u — wu) 2 = (u — U)(v — v), % = (v — vo)*. 

If one wished to test the adequacy of the quadratie representation, 

cubic terms might be included in the regression funetion. 

13.8. The Method of Least Squares. "There is a general problem of 
curve fitting which is entirely unrelated to normal regression theory 
but which may be solved by formulas identical with those we have 
obtained for estimating regression coefficients. 

Suppose some variable x is a function f(z) of another variable z 
and that the function has been investigated by measuring « for certain 
chosen values of z. The result might be as shown in Fig. 66. There 
may be no question of random variation. The value 2; measured at 
2, might be exactly the same if it were determined a second time. ‘The 
function is simply not smooth. But for purposes for which the func- 
tion is to be used, one may wish to approximate it by a smooth func- 
tion, say a straight line. How might such an approximating line be 
drawn? One might simply lay a transparent ruler along the points 
and draw a line which fits pretty well, and this method may be as good 

309 


§13.8 REGRESSION AND LINEAR HYPOTHESES 


as any for the purposes at hand. Or one might divide the points into 
two groups, the left-hand four and right-hand four, and compute the 
averages of the x and z values for the two groups. The averages Z 
and 2 for one group will determine one point, and the averages ï and 2 
of the other group will determine a second point which, together 
with the first, determines an approximating line. There are many 
possibilities. 


x 


z 22 Z Z4 Zs Ze Z, Za z 


Fic, 66, 


The problem is generally solved by what is called the method of least 
squares. This method chooses that line, x = œ + 8z, which minimizes 
the sum of squares of the vertical deviations of the points from the line. 
Supposing now that there are n points (z;, z;) (¢ = 1, 2, - - + , n) and 
that we denote the ordinate of the point on the line at z: by zi, the 
vertical deviations are x; — z; and their sum of squares is, say, 


S = X (z; — 21)? = X (z: — a — 6a)? 


We wish to fix the line (determine a and 8) so that S will be minimized. 
This would be done by setting the partial derivatives of S with respect 
to œ and £ equal to zero and solving for œ and 8, The resulting equa- 
tions are the same as (2.5) and (2.6). 
More generally, any empirical function z; = f(u; wi *- * , Wi) 
(i = 1, 2, - + + , n) may be approximated by any linear combination 
Y, œZ» of known functions z, of the variates u, v, +++ , w by the 
pel 
method of least squares. One would choose the a’s so as to minimize 
the sum of squares of the deviations of the a; from x; = Y apzpi; i.e., 
D 


310 


l———————— 


NOTES AND REFERENCES 813.9 


one would minimize 
S= Y (e Jaen) 
i P 


with respect to the a’s and find that they were determined by the rela- 
tions (5.3). 

The primary reason that the method of least squares is commonly 
used for curve fitting is merely that it leads to a simple linear system 
of equations for determining the coefficients. To determine the coeffi- 
cients by minimizing, say, the sum of the absolute deviations, or the 
sum of the fourth powers of the deviations, would ordinarily be much 
more troublesome. It just happens that the form of the normal dis- 
tribution is such that the sum of squares of deviations from the regres- 
sion function is to be minimized to determine the coefficients in the 
regression function. If, for example, the points in Fig. 66 were sup- 
posed to be deviations from a regression line with a probability dis- 
tribution other than a normal distribution, then it would be appropri- 
ate to determine estimates of æ and 8 by maximizing the likelihood 
defined by that distribution. Even here, though, the method of least 
squares is commonly used in practice to avoid algebraic and arithmetic 
difficulties, and this is, of course, good and sufficient reason. The 
theoretical advantages of the principle of maximum likelihood over the 
principle of least squares may become unimportant when it comes to a 
matter of choosing, say, between a 40-hour and a 10-hour computation. 

13.9. Notes and References. A more complete account of the 
theory of regression may be found in Chap. VIII of Wilks’ book [1]. In 
particular, the proof of the important result that å? is distributed inde- 
pendently of the &’s is given there. The notation of Sees. 5 and 6 
has been made quite similar to that of Wilks in order to facilitate 
reference to that proof and to others which are omitted here. 

There is a great body of literature on à subject which we have 
omitted entirely. A special case of normal regression theory of par- 
ticular interest arises if one considers the conditional distribution of, 


say, xı in a k-variate normal distribution; it js normal with a d 
s) 845. The 
, 


which is a linear function of the other variates, zs, Xs, * * 
coefficients of these variates (corresponding to what we have called ap) 
are certain functions of the variances and covariances of the original 
multivariate normal distribution. Estimation of these coefficients 
implies estimation of certain correlations and partial correlations. 
There is an elaborate theory associated with this sort of correlation 
analysis which was once regarded as a very essential part of statistics. 
311 


§13.10 REGRESSION AND LINEAR HYPOTHESES 


Tn recent years it has come to be realized that most (though not all) 
correlation problems which arise in practice can be handled more 
appropriately by regression methods. The latter require only the 
assumption that deviations from the regression function be normal, 
whereas the correlation analysis requires that the variate and what 
we have called the observable parameters all be jointly normally dis- 
tributed. A good account of correlation analysis is given by Kendall 
[2]. 

A rather complete treatment of the theory of least squares and its 
various applications may be found in [3]. In [4] are treated a great 
variety of practical problems in regression and correlation analysis. 


1. S. S. Wilks: “Mathematical Statistics,” Princeton University Press, 
Princeton, N. J., 1943. 

2. M. G. Kendall: *Advanced Theory of Statistics," Vol. 1, Charles 
Griffin & Co., Ltd., London, 1944. 

3. W. E. Deming: *Statistical Adjustment of Data," John Wiley & 
Sons, Ine., New York, 1943. 

4. M. Ezekiel: “Methods of Correlation Analysis," John Wiley & 
Sons, Inc., New York, 1930. 


13.10. Problems 


1. Verify equations (2.22) and (2.23). 

2. Derive the likelihood-ratio criterion for testing the null hypoth- 
esis that the parameter a of Sec. 2 has the value ap. 

3. Verify equations (3.3) and (3.6). 

4. Verify equations (2), (3), (4), and (5) of Sec. 4. 

b. Verify equation (5.9). 

6. Verify equation (6.3). 

7. Verify equation (6.7). 

8. Given the data: 


2| —6.1| —0.5| 7.2|6.9 .—0.2| —2.1| -3.9 | 3:8 | —7.5 | —2.1 


2| —2.0 0.6| 1.4 | 1.3 0.0 | —1.6 | —1.7 | 0.7 | —1.8 | —1.1 


fit a regression line assuming z is normally distributed about a linear 
function of z, and find a 95 per cent confidence interval for the coeffi- 
cient of 2. 

9. Plot the regression line of Prob. 8 and plot two curves show- 
ing the 95 per cent limits of prediction intervals for x in the range 
TILA KS: 

312 


p—— M] 


PROBLEMS §13.10 


10. Plot a 95 per cent confidence region for the two regression 
parameters of Prob. 8. 
11. Given the data: 


fit a regression plane, and find a 95 per cent confidence interval 
for o°. 

12. Find a 95 per cent confidence interval for a, of Prob. 11. 

13. Test the null hypothesis that a» of Prob. 11 is zero. 

14. What is the 95 per cent prediction interval for z at 21 = 2.5, 
2; = 2.5 in Prob. 11? 

15. Test the null hypothesis that o + 10a» = 0 in Prob. 11. 

16. Using only the first two rows of the data of Prob. 11, fit a 
regression function of the form 


oy + azi + osi 


and test the null hypothesis that a» = 0. 
17. The fitting of polynomials such as the quadratic of Prob. 16 is 


much simplified when the values are equally spaced by using orthogonal 
polynomials. Let 2=0, 1,***, % The first three orthogonal 


polynomials are 
Py =z- 


P= (: 


n 

2 

a n(n + 2) 
2) "um 


n * _ metat, 2) 
2 20 2 


P3= 
Y P:P: = ), PiP: = YPP = 0 
z z z 


m 


Show that ` 


18. Rework Prob. 16, fitting instead the regression funetion 
ao + aP + asPa 


where P; and Ps are defined in Prob. 17. 


19. If xı and xz have a bivariate normal distribution, what are the 
31 


$13.10 REGRESSION AND LINEAR HYPOTHESES 


coefficients (in terms of 11, 722, and p) of the regression function for the 
conditional distribution of xı? For the conditional distribution of zs? 
If the two regression lines were estimated from the same sample, 
would they, in general, be different? 

20. If x1, z», zs have a trivariate normal distribution, what are the 
coefficients of the regression function for the conditional distribution 
of zi, given x» and zs, in terms of the variances and correlations? 

21. If the correlation p of a bivariate normal distribution is zero, 
show that its estimator ô has the density 


[n — 3)/2] X1 — pryo-9 
Vor [(n — 4)/2]! 


for samples of size n. 
22. Referring to Prob. 21, transform / to a new variate 


showing that it has ‘Student’s” distribution with n — 2 degrees of 
freedom so that the ¢ tables may be used for testing the null hypothesis 
p=0. 

23. Assume that the data of Prob. 8 are from a bivariate normal 
population and test the null hypothesis that p=0, 

24. When p is not zero, the distribution of ĝ is not a simple function, 
but it has been tabulated for n, the sample size, less than 25. For 
larger n, Fisher has shown that 
1-5 
rr) 


Ei 
= g log 


is approximately normally distributed with mean 


1 
t-3 
and variance 1/(n — 3). Using this result, estimate roughly a 95 per 
cent confidence interval for p of Prob. 23. 

25. Derive the X criterion given in equation (6.3). 

26. What is the maximum-likelihood estimator of the multiple 
correlation Coefficient Rj.23 (defined in Prob. 27 of Chap. 9). 

27. A variate x is distributed about a linear regression function, 
a + Bz, by the density 


f@=l atpe—-Ww<rdat+pet+h 
314 


PROBLEMS §13.10 


Find the maximum-likelihood estimate of the regression function, 
given the sample of four points (x, 2): (0.3, 1), (—0.6, 2), (—1.7, 3), 
(—1.8, 4). Compare it with the least-squares line. 

28. A variate x is distributed about a + 8z by the density 


fl) =YWeer) t>at fe 
= Mee z<at fe 


Estimate the regression function given the sample of four points 
(a, 2): (3.4, 1), (7.1, 2), (12.4, 3), (15.5, 4). Compare it with the 
least-squares line. 

29. A normal variate zr has mean a + pz and variance c?. The 
parameter z can take only the values zero and one. Set up a test of 
the hypothesis that 8 = 0 and compare it with the test of the equality 
of means of two normal populations with the same variance. (1f the 
two means are m and ps, let a = mı and B = us — ua.) 

30. Referring to the situation described in the first paragraph of 
Sec. 7, set up a test for the null hypothesis a; = 0. Assume that 
there are 4n observations, there being n for each of four treatments: 
no fertilizer, nitrogen, potash, both nitrogen and potash. 


315 


CHAPTER 14 
EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


14.1. Experimental Design. The general subject of experimental 
design is too broad to be included with any degree of completeness in 
this book. It comprises the processes of planning experiments, 
analyzing the results, and interpreting the results. We are primarily 
concerned with the last-mentioned problem, which, in so far as 
statistics is involved, is a matter of statistical inference. The tech- 
nique for making inferences is known as the analysis of variance, and 
it is that technique which will be studied in this chapter. In order to 
motivate the study, it will be instructive, however, to consider briefly 
some of the general aspects of experimental design. 

An experiment is intended to find out something about the relation 
between two or more variables. For example, one may wish to dis- 
cover the effect of carbon content (one variable) on the hardness 
(second variable) of steel; the effect of a drug in preventing colds; 
the value of paint in preserving wood; the effect on flavor of meat 
caused by cold storage; and soon. Any experiment may be thought of 
as an investigation of a function of two or more variables. As we have 

_ noted in the first chapter, some variables may be entirely unwanted 
but must in the nature of things be involved in the experiment. In 
the terminology of experimental design, one variable may be called 
the subject of the experiment while the other variables are called 
factors. Thus, carbon content is a factor which affects the hardness 
of steel (the subject of the experiment); freezing (a factor) affects the 
flavor of meat (the subject). 

In planning experiments, one has on the one hand certain prin- 
ciples of experimental design, and on the other a large class of geometri- 
cal configurations, specific experimental designs. In accordance with 
the principles, one fits a specific design to the projected experiment. 

In the course of this chapter we shall illustrate some of the principles 
and give examples of a few very simple designs. But first we may 
observe two important principles of design which are largely. matters 
of common sense and experience. The first is: every possible outcome 


of the experiment must be anticipated and a conclusion decided upon 
316 


—— 


EXPERIMENTAL DESIGN §14.1 


for each possible outcome in advance of performing the experiment. 
For example, suppose a man claims he can read his wife’s mind to the 
extent that he can very often tell whether she is looking at a red or 
black playing card. To test this contention, the following experiment 
is to be performed: His wife is to look at cards drawn one by one 
from an ordinary deck, and the man is to say in each instance whether 
it is red or black. If the whole deck is to be used, there are 53 possible 
outcomes; he may call 0, 1, 2, - * * , 52 of the cards correctly. And 
let us suppose it is agreed to accept his claim if 40 or more are called 
correctly and to reject the claim if 39 or less are called correctly. This 
simple experiment is now completely designed in the sense that the 
conclusion is only a matter of performing the experiment, observing 
the number correctly called, and adopting the appropriate conclusion. 
If it turned out, for example, that 30 cards were called correctly, 
among them 12 of the spades, the man might argue that he had 
demonstrated his ability because the probability of calling 12 spades 
correctly under the assumption of random calling is so very small as 
to make that assumption absurd. This argument is not valid because 
any set of 30 cards can be found to have some peculiarity which would 
make it highly improbable under random sampling. (In particular, 
of course, the probability of drawing any specified set of 30 cards is 


1 ^i e œ 10-™ for random selection of 30 cards without replace- 


ment. Any inference from experimental data cannot be supported 
by a fiducial probability statement unless that inference was taken 
account of in advance of the performance of the experiment, Any 
seemingly significant but unforeseen inference can only suggest a 
new experiment. It follows, of course, that an experimenter who does 
not anticipate any inferences at all but merely waits to see what will 
turn up in the data, cannot support any conclusion whatever by a 
fiducial probability statement. 

The second broad principle we wish to mention specifically is this: 
there must be an element of randomization in the experiment. An 
experiment is performed to test a hypothesis, or to estimate a param- 
eter or a set of parameters. The hypothesis adopted is supported by 
odds based on a computation which assumes random sampling under a 
null hypothesis. ‘The parameter is estimated by a confidence interval 
with a fiducial probability determined by the assumption of random- 
ness. Tt is quite evident that the results of an experiment cannot be 
supported by probability statements unless the sampling was in fact 
random. Referring to the card-calling experiment described above, 

317 


§14.2 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


the null hypothesis is that the man has not any ability to call the cards 
correctly. The probability of calling 40 or more cards correctly is 
roughly .0001 under the assumption of random calling, and the null 
hypothesis would be emphatically rejected if 40 or more were called 
correctly, provided random sampling is operative under the null 
hypothesis. The proper condition obtains if the cards are presented 
in a random order (by thoroughly shuffling the deck, for example), for 
then the result of the experiment will have a random sampling distri- 
bution under any system of calling which is independent of the actual 
sequence of colors of the cards. (It is tacitly assumed here that red 
and black will be called in about equal numbers, that one will not call 
all 52 cards black, for example.) One could, of course, present the 
cards in some order particularly devised perhaps to confuse the caller, 
and the caller might nevertheless be quite successful and establish his 
ability beyond reasonable doubt, but one could not measure his success 
in probability terms. Statistical inference is impossible in nonrandom- 
ized experiments. 

It has been found in practice that persons cannot be relied upon to 
write down random sets of numbers at will. Randomization in experi- 
mental design must be carried out by actually tossing coins, casting 
dice, drawing numbered chips from a bowl, or the like. Specially 
prepared tables of random numbers have been published to save 
experimenters the trouble of performing these operations. 

14.2. Analysis of Variance in Regression. The analysis of variance 
is a technique for testing linear hypotheses, and basically it is just the 
technique described in the preceding chapter. All we shall do in this 
chapter is study that technique in more detail and investigate simpli- 
fications that can be made in applying the technique to certain special 
problems that arise frequently in practice. The point of view, how- 
ever, will be somewhat different, and to illustrate it, we return to the 
simple linear regression problem. 

Let us suppose that a variate z is normally distributed about a regres- 
sion function a + 8z with variance o2. A sample of size n is observed: 
(x1, 21), (23, 22), * * > , (tn, Zn). Let @ and B be defined by equations 
(13.2.8) and (13.2.9). The sum of squares of deviations from the true 
regression will be divided into two parts as follows: 


D(z: — a — Bz)? = D(x: — à — Bz; + a4 bei — o Bz)? 
= X(t, — @ — Bx)? 
+ 23(x; — & — Bz)(@ + Ba; — a — Be) 


+ 3(@ + Be: — a — Ba (1) 
318 


ANALYSIS OF VARIANCE IN REGRESSION §14.2 


The middle sum on the right of (1) vanishes identically, as may be 
seen by performing the summation and using the definitions of à and B. 
The first sum on the right of (1) is the sum of squares of deviations 
from the estimated regression function; it is just nå? where é? is the 
maximum-likelihood estimate of c? defined in Sec. 13.2. The third 
sum on the right of (1) is, apart from a division c?, the quadratic 
form (13.2.30) in the distribution of à and B. The total sum of squares 
on the left of (1), on division by c?, has the chi-square distribution with. 
n degrees of freedom; it has been partitioned into two parts which are 
independently distributed by chi-square distributions—one withn — 2 
degrees of freedom and the other with two degrees of freedom. 

The third sum on the right of (1) may be further partitioned into 
two parts each of which are independently distributed by chi-square 
laws with one degree of freedom. It is apparent from (13.2.30) that 
â and B are not independently distributed except in the special case in 
which Z = 0. However, Z and Ê are independently distributed, as 
may be seen by changing the variable 4 to Z using the substitution 


a=2-f2 (2) 
in the joint distribution of à and B. Infact, Zand B are independently 
normally distributed. In terms of these variables, the third sum of 
(1) is 

z(a + Ba — a — Ba)? = Z(E — Bt + Ba — a — pz)? 

D(z — a — 62) + (Ê — 8) — 2r 

= (z — a — p3)? + ZI(Ê — p) = 2)? (3) 
= n(ĒZ — a — p3)? + (Ê — 62% = 3) (4) 


I 


The sum of cross products has been omitted in (3) because it is readily 
seen to vanish since Z(z; — 2) = 0. The two terms on the right of (4), 
apart from a factor —2c?, are just the exponents in the univariate 
normal distributions of Z and B; hence they are independently dis- 
tributed by chi-square laws with one degree of freedom. 

The total sum of squares of deviations has now been partitioned into 


three parts: 


x(u— a — pe)? = 2(s — à — ba)? + (È — 62% — D 
á "n 2x ; + n(ž — a — 63)? (5) 


each of which is independently distributed by chi-square laws. We 
turn now to the question of testing whether @ and $ differ from zero. 
319 


§14.2 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


If, in particular, æ and 8 are put equal to zero throughout (5), we have 
Da} = X(n — à — Bz)? + BD(e; — 2)? + nz? (6) 


All these terms are directly calculable from the data, and in the analysis 
of variance, this partition of the sum of squares is usually exhibited 
in a table such as the one given here. In a particular problem, the 
entries in the table would all be numerical. 


ANALYSIS OF VARIANCE FOR SIMPLE LINEAR REGRESSION 


grees 
Source Suec) of Mean square F ratio 
squares 
free- 
dom 
- 2 
Mean nii 1 |nz mnm 
Z(r — å — Bz)? 
- 2*(z. — z)1 
Slope | 622(z; — 2) l |Bx(e; — 2) Ee — 2r 


Bei — à— px)? 


Devi- ^" 1 
ations | £(& — ê — fz)? |n — 2 ar » (a; — à — Bz;)2 


Total Zr; n 
a AS ee i Ta 

Now let us consider the null hypothesis that 6 = 0. If it is true, 
then the sums of squares in the second and third lines of the table are 
independently distributed by chi-square laws with 1 and n — 2 degrees 
of freedom (on division by c?), and the ratio of the mean squares will 
have the F distribution with 1 and n — 2 degrees of freedom. This is 
exactly the test given by (13.2.32) because the square of a £ variate 
with / degrees of freedom has the F distribution with 1 and k degrees of 
freedom (Sec. 10.6). The sum-of-squares entry in the second line 
of the table is said to be the portion of the total sum of squares 22? 
associated with 8. 

Now let us turn to the first line of the table, The F ratio in the first 
line provides a test for the null hypothesis, a = 0, only if it is assumed. 
that B = 0 (unless 2 happens to be zero). ‘Thus the two F tests indi- 


cated in the right-hand column of the table are of two different kinds. 
The second one tests 


8 = 0, whatever a may be 
320 


tee ee ee ee O 


ANALYSIS OF VARIANCE IN REGRESSION §14.2 


the first one tests 
a = 0, provided £ is actually zero 


These statements are evident on comparing (5) and (6). The first 
term on the right of (6) has the chi-square distribution whatever a and 
B may be; the second term has the chi-square distribution whatever 
a may be provided only that 8 = 0; the third term has the chi-square 
distribution only if a + 82 = 0. 

The two tests on wand £ are said to be nonorthogonal. If it had been 
possible to partition the two degrees of freedom for œ and £ into two 
single degrees of freedom, one involving « only and one involving 6 
only in such a way that they were independently distributed, then we 
should have had orthogonal tests of œ and 8 and could test o = 0 
whatever 8 might be. 

If in collecting the data, the values of z are chosen so that 2 = 0, then 
orthogonal tests of œ and 8 are available. For then å becomes equal 
to Z, and in fact the F test indicated in the first line of the table becomes 
equivalent to the £ test given by equation (13.2.31). It is to be 
recalled, of course, that we can test a = 0 without assuming 8 = 0 
by using that ¢ test. 

The condition of orthogonality is regarded as desirable because it 
provides a partial measure of statistical independence in tests. Sup- 
pose Z:— 0; then the two tests of a = 0 and £ = 0 are still not statis- 
tically independent because the two F ratios have the same denomina- 
tor. If one worked out the joint distribution of the two ratios, he 
would find that they are not independently distributed. But the fact 
that the two numerators of the F ratios are independently distributed 
has some intuitive appeal. It is usually impossible to design experi- 
ments so as to get completely independent tests, but it is often possible 
to design them so as to get orthogonal tests. "Thus in the present 
example, one can investigate a regression function a + $z by means 
of the two £ tests described in Sec. 13.2, and these tests are nonorthog- 
onal in general; it may be possible, however, to select z values so that 
Z = 0 and thus obtain orthogonal tests. 

From the practical point of view, orthogonality is quite desirable 
because the analysis of data is usually very much simpler for orthog- 
onal than for nonorthogonal designs. 

Test of Linearity. Before leaving the linear regression problem we 
shall consider one other test which is quite useful when the data are 
such that it is feasible. Suppose that for one or more of the z values 
there are two or more z observations. More precisely, let there be k 

321 


§14.2 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


distinct values of z—z1, 2», * * * , z;,—and let the x observations be 
denoted by z, wheres = 1, 2, -- - , k andt = 1,2, - - -, m... Cor- 
responding to z,, there are thus 7, x observations, and we assume that 
not all the n, are one. Letting» = 2 Na, we may relabel the xs, calling 


them 21, 2», - - - , z, and perform the analysis already described. 

The deviations from the fitted regression may be written 
Xm—a-B)-Y(G.—4-—fBa) (7) 
; at 


in the v, notation; the zs are all distinct, while the z are not, with the 
data under present consideration. 

The right-hand side of (7) will now be partitioned into two parts as 
follows: 
D @a— @ — Bz)? = Y (£u — B+. — 8 — fa)? 


at st 


= X (u — &)? Y — a — pe)’ 
st at 
= Y (tu — %)? + Yn — & — Bz)? (8) 


where 2, = Y, sufns. The first sum on the right has the chi-square 
t 
distribution with y (n, — 1) = n — k degrees of freedom, whatever 


8 
the regression function may be. For, for fixed z, x is normally dis- 
tributed with variance c*, and the sample zi 212, * * + , tim Of nı 


observations for z = 2; provides a sum of squares » (ty — 2), which 
t 


on division by c? has the chi-square distribution with nı — 1 degrees 
of freedom. The first sum on the right of (8) is simply the sum of all 
such chi-squares for the various values of z. The second sum of 
squares on the right of (8) has the chi-square distribution (with k — 2 
degrees of freedom) only if the regression function is in fact of the form 
a+ 87 Thus 
p ZuE — 8 — ba) — 2) à 
X (tee — z)2/(n — k) 
st 


provides a test for the hypothesis that the regression funetion is of the 
form a + 8z, and the critical region is the right-hand tail of the F 
distribution since a regression function different from « + 62 would 
tend to increase the deviations of Z, from & + ĝz,. 

322 


-L 


ONE-FACTOR EXPERIMENTS 814.3 


"Though this device is called a test for linearity, the same technique 
could obviously be used to test the validity of any specified regression 
function provided the function was linear in the unknown coefficients 
and there were fewer coefficients than distinct values of z. 

14.3. One-factor Experiments. As an illustration, let us suppose 
that a factory manager wishes to buy machines to perform a certain 
operation in a production process. There are four companies which 
make such machines, and he obtains one on trial from each company 
with a view to determining which of the four is best suited to his 
purposes. Suppose also that a machine is operated by one man. 
The manager intends to have several of his men operate the machines 
for a few days in order to discover which of the four produces the most 
items per day. In this simple experiment the subject is the number of 
items produced, and the single factor is type of machine. 

Let us suppose that twenty men are to be used in the experiment, 
five being assigned at random to each machine, and that each man will 
work one day on the particular machine he was assigned to. There 
will then be five observations for each of the four machines, each 
observation being the amount produced by the machine in one day. 
The data might be such as appear in the accompanying table. The 
question of interest is whether or not the machines are different with 
respect to number of items produced; i.e., is the subject of the experi- 
ment affected by the factor being investigated? 


Machine number 


In order to analyze these data, the following assumptions will be 
made: the five observations for machine 1 constitute a random sample 
from a normal population with mean £ and variance g°; the observa- 
tions for the second machine are an independent random sample from 
a normal population with mean £ and the same variance c^; and simi- 
larly for the other two machines. The assumptions are thus: * 

1. The samples are random. 


2. The samples are independent. 
323 


§14.3 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


3. The populations are normal. 

4. The populations all have the same variance (often called the 
assumption of homoscedasticity). 

In the general one-factor experiment, the factor will appear at k 
levels (instead of four); the observations will be denoted by «ij, with 
1—1,2,---,k,andj = 1,2, - - - , n; allowing for the possibility 
that there may be different numbers of observations at each level. 
The joint density of the z;; is the product of the individual densities: 


k m 


] W -a/5 E Ð (u-t) 
Vas) ° i=1j=1 (1) 
where n represents =n; The null hypothesis to be tested is that 
£ = & = & = +--+: = E. One could obtain a test by the likeli- 


hood-ratio method, i.e., by maximizing (1) with respect to all the 
parameters, then with all & made equal, and using the ratio as a test 
criterion. We shall, however, proceed differently. 

The average of all the population means will be denoted by £, 


1 1 
gan) B=) mh (2) 
ij i 
and the deviations of the £ from £ will be denoted by 
a= E Zn;o; = 0 (3) 


'The o; are called the effects of the factor; the effects are zero under the 
null hypothesis. Also we shall denote the cell means by 


Pr ee tome Tij (4) 


B= 7) =i Y ma (5) 
y 


The sum of squares of deviations from the population mean for the 
observations in any one cell may be partitioned as follows: 


D (t; — §)? = X Ga aTa EA 
= » (m —m)?-Em-t£* d (6) 


and the two terms on the right of (6) (on division by e?) have inde- 
324 


i 


ONE-FACTOR EXPERIMENTS §14.3 


pendent chi-square distributions with k — 1 and 1 degrees of freedom, 
as follows from Sec. 10.4. On summing (6) over i, the total sum of 
squares is partitioned into two parts: 


> (n; — E) D Cea T X n(Z— &)? (7) 
L pi Li 


independently distributed by chi-square laws with n — k and k 
degrees of freedom. The second term on the right of (7) may be 
further partitioned: 

D ni — 8)? = Yoni — B— +E — 8)? 


= EXC — £— aj)? + n(£ — $) (8) 


The two terms on the right of (8) are independently distributed by 
chi-square laws with k — 1 and 1 degrees of freedom, as may be shown 
by an argument entirely analogous to that employed in Sec. 10.4. 
We have then 


le = r= Y eam mr p Nor ups e pr 0) 


and this partition is usually exhibited in an analysis-of-variance table 
such as the accompanying one with the parameters put equal to zero. 


ANALYSIS OF VARIANCE FOR ÜNE-FACTOR EXPERIMENTS 


grees 
Source Bum of of d F ratio 
squares PR square 
dom 
Mean n? 1 
= #2)? 
; Zmed Y nce, -ae -D 
Effects E ni(# — 2) |k — 1|- EE] — 


Ys -at/(n — B 
7 


Y (zi — £)* 
Deviations 5 (wiz — 2)? 
D] 


Total » DA n 
ij 


§14.4 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


The ratio in the right-hand column obviously has the F distribution 
with k — 1 and n — k degrees of freedom under the null hypothesis, 
o; = 0, and thus provides a test criterion for that hypothesis. Ordi- 
narily there is no interest in testing £ = 0, but if there is, the quantity 
nī? divided by the mean-square deviations will have the F distribution 
with 1 and n — k degrees of freedom under that null hypothesis. This 
latter test, incidentally, is orthogonal to the test on a, for the term 
on the right of (9) does not involve the ai. 

14.4. An Application of Normal Regression Theory. The foregoing 
analysis of the one-factor experiment is somewhat artificial in that the 
partition of the sum of squares seems to have no particular motivation. 
How would one know to embark on such an analysis in the first place? 
Having developed a logical theory of testing linear hypotheses in the 
preceding chapter, why not apply it here? The answer is that the 
foregoing analysis is relatively simple, whereas the application of the 
general theory involves some troublesome algebraic manipulation. 
As experiments become more complicated, the algebra of the general 
method becomes quite complex, involving, as it does, the inversion of 
large matrices. With experience, one can develop a facility for par- 
titioning the sum of squares appropriately and thus save himself a 
great deal of mathematical analysis. 

The simple partitioning of the sum of squares happens to give the 
correct tests when tests are orthogonal, but it does not prove, without 
advanced mathematical arguments unavailable to us here, that the 
tests are correct, A rigorous derivation of the tests does require 
application of the general theory, and we shall illustrate such an 
application for the one-factor experiment. 

The k normal populations of Sec. 3 may be combined into a normal 
regression system with mean 


w= YR 0) 


where 5; is an observable parameter defined to be one when an observa- 
tion is drawn from the ith population and zero otherwise. The means 
% thus become coefficients of a linear regression function. It is 
simpler, however, to set up the regression function in terms of the 
a's so that the null hypothesis is in the form o; = 0 rather than 


f= = +++ = &, ie. in the form of (13.6.4) rather than (13.6.8). 
To this end, we write (1) as 
p = Et Loa; (2) 


AN APPLICATION OF NORMAL REGRESSION THEORY §14.4 


but now we have one too many parameters because the a; are connected 
by =nia; = 0. We shall eliminate a; from (2) by the substitution 


OS Tia), foh (3) 


and get 
k=1 


k—1 
noi Y heil Y na 
T 1 


k—1 


Y e(a- Es) (4) 


5 


Now we define new observable parameters 25 by 


tp = bp — 2 be poci2::c,k—l (6) 
=i p-k (6) 
or,forp = 1,2, *** 4. — 1, 
2,—1 if z; hasi = p 
ENG if zy has i = k (7) 
Ne 
=0 otherwise 


The regression function is now 
k-1 
Jes y ap + Eck (8) 
T 
and is of the form discussed in Sec. 13.5, where £is to be identified with 


the a, of that section. 
Since, obviously, 


£-5 (9) 
we have at once the estimators 
à -Zz-—i Pauls os, Tees (10) 
== (11) 
= Ly (rj — 2)? (12) 


The test of the null hypothesis, a; = 0, is given by (13.6.6) so that we 
327 


§14.4 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


must only evaluate Q, which is defined by (13.6.5). To this end, we 
must examine the matrix ||ap,|| defined by 


Qpa = » ZpijZaij (13) 
ij 


since the sum on 7 in Sec. 13.5 refers to the sum over all sample observa- 
tions and becomes the total sum over 7 and j in the present example. 
25; is, of course, the value of zp for the observation 2,;. It follows 
readily from equation (7) that the matrix is 


a 4% nana ms... mea 0 
Me Mr Ne Nk 
2 
NIN: ng "ns Tis 1 
Z pude roues 5 ce. Dedi 0 
Nk Nk Nk Nk 
llasd| = : : : . 4 (14) 


aA NMNk-ı NNk- |, IE Ty a 0 
Nk Nk Nk A Nk 


0 0 0 pif 0 n 


. To obtain the coefficients bu» which appear in Q, one would ordinarily 
invert (14), then strike out the last row and column (m being b — 1 in 
the present example), then invert the result. This work is not, neces- 
sary in the instance at hand, for Q is the quadratic form of the marginal 
distribution of the 4, (u = 1,2, * ++ , k — 1), and it is apparent from 
the form of (14) that the &, and £ are independently distributed. That 
is, because of the zeros in |la,,|, the joint distribution of the à, and £ 
may be written as the product of a function of the &, alone and a 
function of £ alone. It is evident then that 


Baste Ce eel, Qe S a beh (15) 
hence that 
k-1 AR 
ae —, 4 - EL zen 
Q PC + Mate) (au — a). — o) (16) 


In this expression we put the a’s equal to zero, and we may substitute 
from (10) for the &’s to obtain 


Q = X en -7+ x » NuNy(Eu — £)(Z, — 7) (17) 


TWO-FACTOR EXPERIMENTS WITH ONE OBSERVATION PER CELL §14.5 


The second term is simply 


x Daal 


1 
Me 


k-1 
[X nzu- (a — me’ 


1 
rm [nz — nmi — (n — ngap 
= n(Z — zx) 
Thus Q becomes i i 
k 
1 
0-3 2 nj; — 2)? (18) 
and the F ratio (13.6.6) is 
(RT) -= 
ee Zn(X — 7)?/(k — 1) (19) 


D(wiz — :)'/(n — k) 
the same as appears in the analysis-of-variance table of the preceding 


section. h 

We have shown, incidentally, in this section that the two terms of 
equation (3.8) are independently distributed by chi-square laws. 

14.5. Two-factor Experiments with One Observation per Cell. It 
may have been noticed that the experiment described in Sec. 3 was 
very poorly designed. Thè trouble is that there is an extraneous 
factor, ability of the various workmen, which must necessarily enter 
into the experiment. If, in the experiment of Sec. 3, the production 


from one machine turned out to be relatively large, was it due to the 
lar group of workmen 


machine, or to the excellence of the particu! 
assigned to it? There is no way to tell from that experiment. In 
the language of experimental design, the effects due to machines and 
the effects due to groups of workmen are completely confounded; there 


is no way to differentiate the two factors. 
The difficulty is removed by redesigning the experiment as a two- 


factor experiment. Let, for example, only five men be involved in the 


experiment and let each of the five men work one day on each of the 


four machines. The order in which a given man works on the four 


machines would be assigned by a random process. The data are now 
classified in a two-way table in accordance with the two factors and 
might appear as in the table on page 330. When a two-factor 

factor as in the case here, 


experiment is used to control an extraneous 
the design is referred to as à randomized block design. ‘The factor of 


interest is compared in blocks (men, in the present instance) so that 
conditions of the comparison are homogeneous within each block 


though they differ from block to block. 
329 


§14.5 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


Machine 


1 2 3 4 


1 53 AT 57 45 


2 56 50 63 52 


47 54 42 


Man 
e 
> 
ec 


4 52 AT 57 41 


5 49 53 58 48 


Tn general there will be, say, 7 rows and c columns for a two-factor 
experiment, one factor being examined at r levels, A1, As, * * © , Ar 
and the other factor B at c levels, Bi, Bo, - + - , B... The observations 
may be denoted by tu, it = 1,2, * - - ,r,andj = 1,2, - - - ,c. Itis 
assumed that the £; are random independent observations from normal 
populations with the same variances. It is further assumed that the 
effects of the two factors are additive. This last assumption will be 
discussed further in Secs. 6 and 9. Anéflytically it states that the 
means of the normal population associated with the individual cells 
are assumed to be of the form 


& = &+ ai 8; (1) 
with 
Zo; = 0 26; = 0 (2) 


The parameter £ is the average of all the population means. In terms 
of the illustrative example, the most skilled workman will have a 
positive æ associated with him, and the assumption (1) states that 
whatever machine he works on his production will be exactly o (in 
the population mean) larger than the mean production of all workers 
on that machine. Or in other terms, if one workman is 10 units better 
than another on one machine, he will be ten units better than the other 
on all machines. Similarly if one machine is 10 units better than 
another, that margin is assumed to be the same (in the population 
means) regardless of whether a workman is good or bad. 

In the general two-factor experiment, the two null hypotheses of 
interest are a; = O and 8; = 0. (In the illustration we are using, there 
is, of course, little interest in the œs.) We shall therefore try to 


partition the total sum of squares into parts, one of which involves the 
330 


TWO-FACTOR EXPERIMENTS WITH ONE OBSERVATION PER CELL §14.5 


ai, another the 8;, and another ¢ The proper procedure is suggested 
by the estimators of these parameters, which are readily found to be 


PE e 
gcc Tij (3) 
3 
z eal: = 
&=%-E= 7) tua (4) 
j 
~ ene 
B-z-s-L)m-t (5) 


Zi, being the mean of the observations in the 7th row and Z; the mean 
of those in the jth column. The total sum of squares may be par- 
titioned as follows: 


» (oan — fpe = Ys — %. — Za + 2) + (Ei — E — o) 
i a 
+(t;-2-B) + @-9F (8) 
=) Gi; a Zi +2)? +c) (s — Fou)? 
a H 
+r), (z4 — z- Bi)? + roa- 4)? (7) 
j 
Equation (7) is obtained by squaring the expression in (6), using the 


grouping indicated by the parentheses; then it is easily seen that the 
cross-product terms sum to zero. 


ANALYSIS OF VARIANCE FOR Two-ractor EXPERIMENTS WITH ONE OBSERVATION 


PER CELL 
Degrees of Mean square r 
Source Sum of squares Feodom quare Ftio 
sı 
Mean rez? = Sı 1 81-25 $i 
8. 82 
A effect e» (ži. — #)? = S2 r-i E = 8 ii 
i 
8: 83 
B effect, DACIA od s = 83 a 
j 
x Ss 
Devi- dto day dun E orici red 
Bache » (p = Be. — BaF? = Be © X (r — (c — 1) € 
"Total Di us TC 
ij 


331 


§14.5 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


Tf we had some relatively advanced techniques at our disposal, the 
analysis would now be virtually complete, for then it would be possible 
to argue that the four terms on the right of (7) are each independently 
distributed by chi-square laws (on division by o*)—the first with 
(r — 1)(c — 1) degrees of freedom, the second with r — 1, the third 
with c — 1, and the fourth with one degree of freedom. Assuming the 
truth of this statement for the moment, we may construct the table 
shown on page 331. The 7 ratios in the final column give orthogonal 
tests of three null hypotheses: £ = 0, a; = 0, 8; = 0. 

To demonstrate the validity of the above analysis, we must investi- 
gate the tests more formally. Again the general theory of testing 
linear hypotheses will be employed. Equation (1) may be put in the 
form of a linear regression function by defining observable parameters 
6; and e; so that 


6; =1 if z;; has i = 7 (8). 
E otherwise 
g=1 if z; has j = 7’ (9) 
f =Q otherwise 
Then (1) becomes 
& = &+ » 0;o; + ay, €jB; (10) 
i i 


This relation involves only 7 + c — 1 parameters in view of conditions 
(2), so we shall eliminate a, and 8, from (10) to get 


tii = $ + 26: — Sai + Eej — €)8; (11) 
and as in Sec. 4, new observable parameters are defined by 


Zp = 1ifx;hasi = p 


= —lifajhasi =r pzl2,:*:/r—1 (12) 
= 0 otherwise 
Zp = lifzyhasj—p —r+1 
= —lifz;hasj-—c v AE. 
= 0 otherwise mee 2 (18) 
Zr-1 = 1 (14) 


There are thus r + c — 1 observable parameters; the first r — 1 are 
associated with the o's, the next c — 1 with the 6’s, and the last one 
332 


TWO-FACTOR EXPERIMENTS WITH ONE OBSERVATION PER CELL §14.5 


with £ The population mean is now of the form 


tte-1 
QE (15) 
221 
if we redefine 
B= ag 19-12,7-5,c—1 (16) 
E = onse (17) 


The &, are given by equations (3), (4), and (5), and 4? is readily seen 


to be 
ea D (s - 2 ZA 
= zy (sa = Y aà = Ys ES 3 (18) 


-lye -n-znss 


The joint distribution of the â, is normal, with the matrix of the quad- 
ratie form defined by fe 
Ang = Y £pi qii (19) 
Pd » pj qii 
and on evaluating these sums using (12), (13), and (14), it is easily 
found that 


2c c do 0 0 0) 10 
C 2006 40 0 0 0 ;0 
c € 2c c0 0 0 0 10 


lazdi = o 


. 
§14.6 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


There are r — 1 rows and columns in the upper left-hand block, and 
c — 1 rows and columns in the block completely enclosed by dashed 
lines. The form of |lap,\| shows at once that the three sets of param- 
eters (di, ds, * * * , 1), (Ên Bo, > © > , Êe—1), and (É) are independ- 
ently distributed; hence their quadratie forms 


£1 


» a;/(&; — ai) (âr — av) (21) 
ive 
han 
5 0,144,714 (8; — Bi) (B — By) (22) 


jj-1 
re(É — 2)? (23) 


are independently distributed by chi-square laws with r — 1, c — 1, 
and one degrees of freedom, respectively. All three of them are 
distributed independently of 


rcg? [which has rc — (r — 1) — (e — 1) — 1 = (r — D) (c — 1) 
$ degrees of freedom] 


in view of the results of Sec. 13.5. These three forms reduce directly 
to the last three terms of (7); hence the F ratios of the analysis-of- 
variance table are all of the form (13.6.6). 

14.6. Two-factor Experiments with Several Observations per Cell. 
To continue the illustration that has already been used, suppose again 
that there are four kinds of machines to be tested with five men and 
also that instead of one each of the machines, there are three each. 
Every man works one day with all twelve machines, and the data are 
classified again in a 4 X 5 array, but now there are three observations 
in each cell corresponding to the three machines of each type. 

In general, we shall suppose that there are r rows and c columns and 
that there are m observations in each cell. There will then be rem 
observations altogether which will be denoted by tix (i = 1,2, - * * ,7; 
j71,2,---,c;k—1,2,---, m). The observations in the 
(7, j) cell are assumed to be a random sample from a normal population 
with mean £,; and variance o°, the same for all cells; the cell populations 
differ only in their means. The numbers £; may be put in the form 


& = E + ai + Bi + Yä (1) 
with 
ya =0 X86 = 0 Yi =0 Yw=0 (2) 
: 334 : 


TWO-FACTOR EXPERIMENTS §14.6 


To do this, one first computes 


then 


and finally, the yy, using (1). £is called the mean effect; the a; are 
called the main effects due to rows, or briefly the row effects; the 8; are 
called column effects; and the yi; are called the row-column interaction 
effects, or simply the interactions. When the interactions are all zero, 
the means £y are said to be additive (see preceding section). 

We shall now partition the sum of squares into parts suitable for 
constructing tests on the mean effect, row effects, column effects, and 
interactions. Considering first the observations in a single cell, she 
sum of squares may be divided into two parts just as was done in 
equation (3.6): 


Y Gua — B)* = Y, Gn — Bi)? + KG. — &) (3) 
T T 


where Zy. is the cell mean and is the estimator of &; The sum on the 
right of (3) has k — 1 degrees of freedom, and the other term on the 
right is independently distributed of the sum with one degree of 
freedom. Summing (3) over all cells 


X} Ce -&) = S — Hy)? + n (fu, — E)? (4) 


the total sum of squares is divided into two parts independently dis- 
tributed by chi-square laws (on division by 0”), the first with re(m a 1) 
degrees of freedom and the second with re degrees of freedom. The 
second sum of squares for the cell means may be partitioned into four 
parts just as was done in equations (5.6) and (5.7) for the case of a 
single observation in a two-way table. The result is 
mY (tes = m C aS dee va)? 
M = 
+ me X (&.. — Z — a)? + mr Y, (£j — Z — Bi)? + mre(ž — £)? (5) 
i i 


which differs from (5.7) only in the appearance of m and y; The 
335 


§14.6 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


symbols Z;,. and z.;, are the row and column means 


m n = E ES 


J 


1 rR 
Bi. mr E ijk = 7 > Tug. 


E i 


Ti. 


while Z represents the mean of all the observations. 

It is now apparent why the population means were assumed to be 
additive in:Sec. 5; the first term on the right of (5) corresponds to the 
deviation sum of squares in (5.7), and if the y; were not zero, it would 
be impossible to carry out the tests described there, because the yi; are 
usually unknown parameters. However an alternative model to be 
described in Sec. 9 allows the tests of Sec. 5 to be made in any case. 

Returning to the present problem, the total sum of squares has been 
partitioned into parts which may be exhibited as in the accompanying 
table. The degree of freedom corresponding to £ has been omitted, 
as it often is in such tables, because there is practically never any 
interest in testing the null hypothesis that £ = 0. The three F ratios 
in the final column of the table may be used to test the three null 
hypotheses: o; = 0, 8; = 0, yi; = 0. These are the appropriate tests 


Degrees of s 
Source Sum of squares IR Mean square F 
t zt =< Sy En 
Row me) (.—21- Bi r-1 =a 
5 r=1 84 
i 
Column mr) Gu = 2) = 8. oh Bi was 5 
x e—1 84 
3 
Interaction D, sm ns z)t- - - aum ya 2 
teraction | m9, (žu. — #4. -2 +2? B| -De-D|c—pe-57"| & 
ij 
Deviations » [TA re(m — 1) EAL 
2 rem = 1) 
ijk 
Total y (ri — 2)? rem — 1 
ijk 


for these three hypotheses under the theoretical model used here. 

Actually in practice the row effects and column effects are rarely 

tested in this manner. Ordinarily the two sets of main effects are 

tested by the ratios s:/sz and s2/ss. These tests do not make sense in 
336 


cw CENA e E 


THREE-FACTOR EXPERIMENTS $147 


theory with the present model if the y; are not zero, for then it is 


m X (Gy. Te = Ey, FE yy)? 
ij 


which has the chi-square distribution, not the quantity S5 in which the 
i have been put equal to zero. 

The rationale for comparing main effects with interaction rather 
than deviations in an F test may be indicated as follows from a purely 
praetieal standpoint: Using the men and machines illustration, 
suppose the null hypothesis, y;; = 0, is rejected. The implication is 
that while one man does better on one machine than another man, 
he may not do as much better than the other on a second machine or he 
may even do worse. Suppose these interactions between men and 
machines are of the order of 3 or 4 units produced per day. It would 
be quite surprising, in view of such interactions, if the main effects 
were nót at least of this order (3 or 4 units per day). In fact, the 
vanishing of the o; or £; in the face of nonvanishing y; would rightly 
be regarded as a pathological case. Suppose the 8; (the main effects 
due to machines) are, in truth, of the same order of magnitude as the 
interactions. Then certainly the differences between machines are 
of no praetieal eonsequence, for one might purchase what appears 
to be the best machine only to have it operated by a man who does not 
happen to work so well with that machine, and better production + 
might have resulted had another machine been purchased. Obviously 
machine differences are important only if they are large relative to 
the men-machines interactions. 

Arguing very crudely now, the sum of squares Sz in the table is a 
measure of the “variance” of the 8; since 


Bj-z;.—i 
and S; is a measure of the “variance” of the yi since 
dg = By. — i — ELE 


The ratio s2/ss measures the relative sizes of these “variances,” and 
if the ratio is large (relative to unity), the machine differences are 
important in relation to the interactions. These rough considerations 
will be made more precise in Sec. 9. $ 
14.7. Three-factor Experiments. To augment our illustrative 
example, the products of the machines in question may be made in 
several different sizes, and for purposes of the experiment three sizes 
may have been selected for inclusion. There would then be three 
337 


§14.7 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


——————— 


atu 
D utn z(e — "ne) mo 
qp 
(T — w)tasata — tng) X suoyenoq 
fu 


(r— *2( — *)( — +) 


i(z— Feb he Uk eg — Pe Hg — PND) K wu 


uonovayut Jg V 


E — 2) 


ir 
e+ Erre Ug) uu 


uorjowojut OT 


(r— *2(r- ™) 


n 


a(g + g pce) (au 


uomoviojut DV og 


(r— *à(r— u) 


(E+ EF — 


MW 
x mg) K 2 


8 


uonouiojut gy 


[1 
og — tg) AQ ena 


ORE qoogo 9 
: 

je) ;(z— x) x satnu 490 q 
1 

tae a(z — z) « szu pəpə y 


opoo1j Jo seo1so(q 


sorenbs jo umg 


amog 


a mq rusa AE ee eS 


LATIN AND GRECO-LATIN SQUARES 814.8 


factors: machines at four levels, men at five levels, sizes of product at 
three levels. The observations might then be arranged in a three- 
dimensional table with 4 X 5 X 3 = 60 cells, and if there were three 
machines of each type, there would again be three observations per 
cell or 180 observations in all. 

In general, let there be three factors A, B, and C with levels rs, 72, 73, 
respectively, and let there be m observations per cell. The observa- 
tions may be denoted by zz, where h 21,2,*:*, n; i= l, 2, 

+, Tag = D eee Tek 157255. The observations 
are assumed to come from normal populations with means £i; and 
variances c?. The means may be written in the form 


bg = E + an + Bi + Yi H Bis eut f + mii (1) 


where any letter on the right sums to zero on any one of its indexes. 
The dni, eni, Çiz are called two-factor interactions, or first-order interactions; 
the maj are called three-factor interactions, or second-order interactions. 
The details of partitioning the sum of squares are so similar to those 
of the preceding section that we shall merely present the resulting 
analysis-of-variance table here. The mean squares are obtained by 
dividing the sums of squares by their corresponding degrees of freedom. 
The various null hypotheses (ær = 0, à; = 0, ete.) are tested by 
dividing the appropriate mean square by the deviation mean square 
and comparing the result with the critical F value. Here again, most 
of these tests would be pointless in many practical situations if some 
of the interactions were nonvanishing. 

If there is only one observation per cell, there will be no deviation 
sum of squares, and it is necessary to use the three-factor-interaction 
sum of squares in its place. With the present model this substitution 
requires the assumption that the mij are zero. 
tin Squares. Latin and Greco-Latin 


14.8. Latin and Greco-La' i irec 
squares are devices for reducing the scope of experiments which involve 
when it is impossible 


several factors and for performing experiments 
to obtain observations for all combinations of all levels of the factors. 


As an illustration of the latter case, we may alter the example already 
used. Suppose four kinds of machines (one of each kind) must be 


tested in one day and that a man must work at least 2 hours on à 


machine in order to get an adequate measure of his production on that 


machine. The 8-hour working day will be divided into four 2-hour 
periods, but now à third factor has entered the experiment because the 


time periods differ, at least to the extent that the workmen may be 


expected to be less efficient toward the end of the day due to fatigue. 


339 


§14.8 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


We have then three factors: machine, men, and time periods. But 
it is impossible to obtain observations for all combinations of all levels 
since, for example, all men cannot work on the first machine during 
the first time period. The difficulty is met by setting up the experi- 
ment so that all factors appear at the same number of levels. Thus, 
since there are four machines and four time periods, we should use four 
men in the experiment. 

The experiment is performed by arranging the levels of one factor 
in a Latin square which is simply a square array of letters such that 
every letter appears once and only once in every row and column. 


We may identify the four letters with the four machines. The rows 
and columns are assigned to the other two factors. Thus if rows 
refer to men and columns to time periods, then the second man works 
on the first machine (A) during the third time period. Of course 
this design could be used for any three-factor experiment where all 
factors are at four levels each. Such an experiment would require 
64 observations for all combinations, whereas with the Latin square 
it can be done with 16 observations; but of course this reduction in 
size of the experiment is at the expense of precision in the results. 

In general, let us suppose that the three factors of a Latin square 
have r levels each and that the observations are tig) where 7, j, k = 1, 
2, ++ +7, and where 7 refers to rows, j to columns, and X to letters in 
the square. The (k) is enclosed in parentheses to indicate that it is 
not independent of i and j. The observations are assumed to come 
from normal populations with the same variance c? and with means 


fia = £ + ai + Bi + ys (1) 


in which Zo; = 0, 26; = 0, Zy, = 0. All interactions are assumed 
to be zero in this model. 

If we denote the row means by &;., the column means by Z.;, and the 
means of observations associated with the kth letter in the square 
(the kth level of the third factor) by Ta), the sum of Squares may easily 

340 


LATIN AND GRECO-LATIN SQUARES §14.8 


be partitioned as follows: 
Y Gum — fiw) = T Y —rf-etbr)G 25-8» 
ij i j 
Re D (a) — £ — y)? + » (tjw — & — By — Fay + 22)? 
a 

+r@ — )% Q) 
All these sums on the right are independently distributed by chi- 
square laws (on division by c?); the various sums have degrees of 


freedom indicated in the accompanying analysis-of-variance table. 
The degree of freedom for the mean has been omitted from the table. 


Ee 


Source Sum of squares pe 
Rows rE(&, — 2) nd 
Columns rE(R;—22? S 
Letters rE(£g) — £) LE 
Deviations (za — ži — Zi — w +22)? | r D - 2) 


The three null hypotheses—a; = 0, 8; = 0, Ya = 0—are tested by 
dividing the appropriate mean square by the deviation mean square 


and using the F distribution. 


r of levels of the factors is à prime number or à power 
of a prime number, then it is possible to test more than. three factors 
without increasing the number of observations. A Greco-Latin square 
is an arrangement of r Greek and r Latin letters in an r X 7 square 


so that each Greek and each Latin letter appears once and only once 
341 


If the numbe 


§14.9 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


in every row and column and such that every Greek letter appears 
once and only once with each Latin letter. With such an arrangement, 
four factors may be tested at r levels each, using only r? observations, 
while the complete experiment would require r* observations. The 
analysis-of-variance table would be similar to the one above for Latin 
squares; there would be an extra line for Greek letters having sum of 
squares rXE(Zg, — 7)? with r — 1 degrees of freedom, and the error 
sum of squares would become 


» (rg — E. — Bi — £g) — £o) + 32)? 
ij 


with (r — 1)(r — 3) degrees of freedom. The Zo) represents the mean 
of those observations associated with the Ath Greek letter. 

More generally, when r is a prime or a power of a prime, it is possible 
to arrange r — 1 sets of r letters in an r X r square so that each letter 
of every set occurs once in every row and column and once with each. 
letter of every other set. By means of such arrangements, many fac- 
tors may be studied in one experiment with relatively few observations. 

14.9. Components-of-variance Models. In this section we shall 
consider an alternative mathematical model for analyzing factorial 
experiments. To introduce the ideas, we shall consider a two-factor 
experiment with one observation per cell, the same situation as was 
discussed in Sec. 5. The observations are again denoted by æ; with 
$21,2,:*-,7(:and j— 1, 2, +>- ra In this model the row 
effects, the column effects, and the interaction effects are all assumed 
to be random variables. Specifically it is assumed that 


Ti = Ui + Vj Wig (1) 


where ui, Us, * * * , Un i$ a random sample from a normal population 
with mean £u; the v; are an independent random sample from a normal 
population with mean £,; and the w;; are an independent random sam- 
ple from a third normal population with mean £,. 

Altering the circumstances of the experiment in Sec. 5 slightly, 
let us suppose that there are a large number of manufacturers of the 
machines in question and that four particular makes were chosen at 
random. Also the five men chosen to participate in the experiment 
were chosen at random from some large group of men. It is assumed 
then that these five men have production abilities U1, Us * t * , Us 
which constitute five observations from a normal population. Sim- 
ilarly the four machines have productive capacities v1, vs, v3, v4 which 

342 


COMPONENTS-OF-VARIANCE MODELS 814.9 


constitute an independently drawn sample from a second population. 
The variables wi; may be looked upon as a sum of two variables, say 
yay + zi, with the yi interpreted as the interactions between men and 
machines and the zy consisting of miscellaneous minor effects which. 
influence the final observations. These two variables y and z are 
assumed to be normal random variates, and their sum w will then bea 
normal random variate. 
Referring back to equation (1), if we let 


tsiti tin a=u-h D-w-b c= Wy— bo 
then the equation may be written in the form 
wy =+ u + bj cu (2) 


where the three variates a, b, c now have zero means. We shall denote 
their variances by o2, of, o% respectively. Clearly the mean of any 
xi is E, and the variance of any Ty is 


a= o} toto? 6) 


since the three variates are assumed to be independent. It is to be 
observed that the z themselves are not independent if they fall in the 
same row or column. Thus, for example, 


El(eu — (2 — 8] = Hla + br + eur + bz + ew)) (4) 
m (5) 


which arises from the a? term on the right of (4). Similarly the covari- 
ance between two observations in the same column is oj. 

With the present model, the null hypothesis that the row effects are 
identical takes the form c2 = 0. This is to say that the ai, which 
have mean zero, are actually identically zero; their distribution is 
concentrated at a point (zero), which is the only way c2 can be zero. 
Similarly the null hypothesis that the column effects are all the same 
takes the form o; = 0. 

To test these two hypotheses, the sum of squares is partitioned just 
as before: 


JCR 29 = 2 es —Z —Zzj £t n (E. — 2) 
tnYG -9* (0) 
“Fe J 


814.0 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


in'which the degree of freedom for z has been omitted. If we substi- 
tute for the «’s on the right in terms of a, b, and c, the result is 


3, Gu — 2)? =) (n — E — 6 OT re Y (ate ao 
gg i i 
+n) 6; +é;+b—22 (7) 
j ‘ 


ij 
where the e's are defined in the same way as the 2’s and à = Zar, 
b = Zbyc. It is easily shown that the three sums on the right are 
independently distributed by chi-square laws by using the results of 
Sec. 5. The results of the latter part of that section may be applied 
to the cy since these variables are independently normally distributed. 
(Since the cy all have zero means, the means o;, 8;, — of Sec. 5 are all 
replaced by zero.) It follows from Sec. 13.5 and equations (5.20), 
(5.21), (5.22) that Z(c — &; — £j + €)? is distributed independently 
of the deviations & — é and €; — €; further, the set of deviations 
Cj. — č is distributed independently of the set C; — €, as follows from 
(5.20). Also the sum in question when divided by c? has the chi- 
square distribution with (r, — 1)(r2 — 1) degrees of freedom. 

Since, by assumption, the c's are independent of the a’s and b’s, it 
follows that the first sum on the right of (7) is distributed independ- 
ently of the other two sums. These other two sums are also dis- 
tributed independently, since the variables a; and the variables ĉi — 
are independent of the b; (by hypothesis) and the C; — € (by equation 
20 of Sec. 5). Furthermore, these two sums are distributed by chi- 
square laws. For considering the sum X(a; + 6. — à — £)*, we may 
let, 

yi =a + ĉi. 
and we know that y is a normally distributed variate with mean zero 
and variance o? + (c2/r)). "Thus 

Z(y; — 9)? 

95 + (02/72) 
has the chi-square distribution with Tı — 1 degrees of freedom; hence 
it follows that the second sum on the right of (7), when divided by 
(7202 + 02), has the chi-square distribution with ^: — 1 degrees of 
freedom. In the same vein, the third sum on the right of (7), when 
divided. by (rio? + o2), has the chi-square distribution with T&— 1 
degrees of freedom. All these results may be summarized in the 
accompanying table. The final column provides the divisors which 
make the corresponding sums of squares chi-square variates. 
344 


COMPONENTS OF VARIANCE §14.10 


TENE Surobetuntes Degrees of Expected mean 
freedom square 
Rows rsZ(Z; — Z)? zi of + ros 
Columns rıD(ž.; — 2)? tr. —1 o + rog 
Deviations Z(rg—f.—2;42) (r1 — 1)(rs — 1) o 
Total I(x; — 2)? fita — 1 


ee ns Se EEE 


To test the null hypothesis c2 = 0, one again uses the ratio of the 
row mean square to the deviation mean square and compares that ratio 
with the critical value of the F distribution for rı — 1 and 


(ri 1) D 


degrees of freedom. For under the null hypothesis these two sums of 
squares have the same divisor oł; hence that unknown parameter 
cancels out in the ratio of the two chi squares, and the ratio of the mean 
squares has the F. distribution. The tests for the row and column 
effects thus take exactly the same form as those of Sec. 5, but here no 
assumption of additivity is required. 

14.10. Components of Variance for Two-factor and Three-factor 
Experiments. For a two-factor experiment with m observations per 
cell, the observations are assumed to be of the form 


Tir = E + a: + bi + eg + Cie (1) 


where the a’s, b’s, c's, and e’s are normally distributed with zero means. 
The a’s ate associated with row effects, the b’s with column effects, 
the c’s with row-column interactions, and the e’s with all other mis- 
cellaneous effects which influence the observations. The variances of 
these variates will be denoted by 02, of, 0%, and o2; the c5; is used in 
preference to c? to indicate more clearly that it refers to the population 
of row-column interactions. We shall leave the details as an exercise, 
since they are very similar to those of Sec. 9, and merely present the 
results. The final column of the accompanying table shows at a 
glance the appropriate ratios of mean squares for testing the various 
null hypotheses: for cz, = 0, one compares the interaction mean 
square with the deviation mean square (this is sometimes called the 
test of additivity); the main effects are tested against interaction (not 


against deviations as was the case in Sec. 6). 
34b 


§14.10 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


———_—-A SSS 


Degrees of Expected mean 
Source Sum of squares Freedom AR 
Rows mr22(ž;.. — £)? n-l 92 + maa, + mrs? 
Columns mr (Ej. — 7)? Te —1 7; + moa, + mrio? 


Interactions | m2 (iz, — Z: — £3. + 2)? |(ri — 1)(rs — 1)| e + mor, 


Deviations | Z(z;j — £j)! 


rirs(m. — 1) ci 


Total Eltin — 2)? 


rirgm — 1 


1 aaaaaaaamaaaaiħňă 


For the three-factor experiment, the expected mean squares for the 


table of Sec. 7 are: 


SF 


Source Expected mean square 
A effect 9; + Moise + mrso?, + mre? + mrar? 
B effect e; + Moise + mraz, + mrt; + mnrirsot 
C effect 9; + Moise + mræ2, + mrio, + mrirqot 


AB interaction 


o; + mol, + mrio?, 


AQ interaction. 


o; + Moire + mrs? 


BC interaction 


o; + mol, + mri, 


A BC interaction 


9; + moire 


Deviations 


7; 


a e m dpt Jte 


where oj is the variance of the population of A main effects, c2, is the 
variance of the population of the two-factor (4B) interaction effects, 
i is that of the three-factor interaction effects, and so forth. The 
expected mean squares for experiments with more than three factors 
may be readily written down as follows: Every expected mean Square 
involves the deviation variance with coefficient, one and all other 
variances which have subscripts containing all the letters correspond- 
ing to the mean square in question. The coefficients of these variances 
are the products of the ranges of all indices on the z's except those 
associated with subscripts on the variances, 


346 


COMPONENTS OF VARIANCE 814.10 


A very troublesome difficulty is encountered in three-factor and 
higher order components-of-variance models. In the present instance 
one obviously tests the three-factor interaction against deviations, and 
he tests the two-factor interactions against the three-factor interaction, 
but what is to be done with the main effects? On putting c2 = 0 to 
test the main effect of A, there is still no pair of chi squares with com- 
mon divisors. If it happens that one of the two-factor interactions is 
zero, there is no trouble. Thus, if the hypothesis c2, = 0 is not 
rejected, then the main effect of A may be tested against the AC 
interaction. 

If neither of the two-factor interactions is zero, then a theoretically 
satisfactory test for the main effect in question can become a trouble- 
some matter. In practice, the following simple approximation device 
is ordinarily employed: Suppose it is desired to test c2 = 0. Let 
Vi Ya Ys be the sums of squares for AB interaction, AC interaction, 
ABC interaction; let ni, n», n; be their respective degrees of freedom; 
let kı, ke, ks be their respective expected mean squares. Since y;/k, 
is a chi-square variate, the mean and variance of y, are nj; and 2n,k}. 
It is evident from the above table of expected values that the variate 


245 y ya, Ys 
Ve nı 5 Nz M3 @) 


has the right mean value for an F test of 2 = 0, but v does not have the 
distribution of a mean square. However, if the m are large, the shape 
of the distribution of v does not differ much from the shape of the 
distribution of a mean square, and the approximate test treats v as 
if it did have such a distribution. The only question remaining is 
how many degrees of freedom shall be associated with v. Letting M 
be this number of degrees of freedom, one determines N so that the 
variance of the approximating distribution is the same as the variance 
of the actual distribution. The true variance of v is 


2 2 2 
a -2( H LH) (3) 


1 ns, 


while the variance of a mean square with N degrees of freedom and with 
expected value kı + ks — ks is 


2 (hy + ka — ko)? (4) 
847 


§14.11 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 
On equating (3) and (4), N is found to be 

(ki + ks — ks)? 
(ki/m1) + (k$/n) + (k5/na) 


In practice, of course, the k, are unknown, but they can be estimated 
by the y;; i.e., £, = y:/m. Thus the approximate test for o2 = 0 is 
to treat 


N= 


(5) 


mrst's (Ep... — z)? 


ee 6) 
e - o (2+2 +H) 
2 m m m 


as an F variate with rı — 1 and N degrees of freedom, where N is 
determined by (5) with the k: replaced by y,/n;. 

14.11. Mixed Models. We are now able to investigate the mathe- 
matical model ordinarily needed to analyze data from factorial experi- 
ments. In most experiments, the levels of some factors are to be 
regarded as fixed constants whereas the levels of other factors must be 
regarded as random variables; hence the required model must be a 
combination of the two models already discussed. As an illustration 
of such a model, we shall return to the experiment described in Sec. 6. 
The effects of the machines will be regarded as fixed constants, while 
the effects of the workmen will be regarded as a sample of observations 
from some population of workmen. 

Using the notation of Sec. 6, the observations are now regarded as 
being of the form 

Tir = E + a: + B; + eg + eg (1) 


where now î = 1,2, +--+ £005 15 1,2, Tb—1,2,:--,m. 
The a; are observations from a normal population with zero mean and 
variance c2; the 8; are constants whose sum is zero (the average 
machine effect, for example, is included in £); the c's and e's are random 
observations from normal populations with zero means and variances 
725 and c? 

a6 and oj. 


The sum of squares is partitioned as before into parts associated with 
the various factors: 


nep Z(vg. — 24)? + mZ(zg.— m — 25. zy 
TomnE(m.— 7)? + mnZ(z;-z) (2) 
On substituting for the z's in the first sum, it becomes Zen — 65)5; 


hence on division by c? this sum has the chi-square distribution with 
348 


MIXED MODELS 814.11 


ryre(m — 1) degrees of freedom. The second sum becomes 
m Y, [eg + &, — (Ge. + &.) — (Er + Ea.) + C+ P 
^ 


and replacing the c; + &j. by yi; which has.variance o23 + (o2/m), it 
is evident that this term, when divided by mo25 + o?, has a chi-square 
distribution with (ri — 1)(r2 — 1) degrees of freedom and is distrib- 
uted independently of the first sum, since the deviations ei, — 6j; 
are independent of the &;. Similarly the third sum is independent of 
the first two and has the chi-square distribution with rı — 1 degrees 
of freedom on division by c2 + mo23 + mrs?. 
The final sum on the right of (2) becomes 


mri » (Ea + s. — E — 6€ + B)? 
J 


which is independently distributed of the other sums but does not have 
the chi-square distribution. However the quantity 


mnjX(E, — E — Bj 


does have the chi-square distribution (on division by o} + mo?) ; hence 
under the null hypothesis, 8; — 0, the final sum on the right of (2) 
does have the chi-square distribution. 

The analysis-of-variance table presented here may be compared 
with that of Sec. 10. The final column shows at a glance what ratios 


Degrees of Expected mean 
Source Sum of squares Amines pacers 
A effect mre (Ei... — z)* n-—1 o? + moze + mra? 
mr DB} 
B effect mriD(&.;. — 2)? Ts —1 o? + mo;g dede of 


AB inter- A si 
action | mE(fi — ži. — By. +2)? | (r1 — 1) (r2 — 1) | oi + meag 


Deviations | E(£ij — Žij.)? ryrs(m — 1) o 


of mean squares are appropriate for testing the various null hypotheses. 
The main effects in both cases are to be tested against interaction, not 


against deviations. 
349 


§14.12 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


14.12. Analysis of Covariance. The analysis of covariance is a 
technique employed in analyzing factorial experiments when the 
subject of the experiment is related via a regression function to certain 
observable parameters. As an example of an experiment in which 
the method would be used, let us suppose that penetration of different 
kinds of steel plates by 50-caliber projectiles is being studied. Sup- 
pose there are k plates, one of each kind, and that m projectiles are 
to be fired at each plate. The depth to which the jth projectile pene- 
trates the ith plate will be denoted by z;. Thus far we have a one- 
factor experiment with k levels and m observations per cell. But the 
velocity of the projectiles will be a critical factor in the depth of pene- 
tration. We shall suppose that this factor is not of interest for pur- 
poses of the present experiment; we merely wish to observe for a fixed 
velocity whether the resistances of the plates differ significantly. 
However it is impossible to fire each bullet with exactly the same 
velocity; and in performing the experiment, the velocity of each one 
will be measured photographically, and then the effects of the varia- 
tions in velocity will be taken account of in the analysis of the data. 
Let the velocities be denoted by z;. The observations Tij are now 
assumed to be normally distributed with variance c? about the linear 
regression functions 


a + Bie (1) 


In the experiment just described, the observable parameter z is 
associated with an extraneous factor (velocity) which cannot be entirel y 
controlled and must be dealt with in the analysis of the data. In 
other experiments, the observable parameter may be associated with 
a factor of interest. Thus in the above experiment we may desire to 

. study the two factors—type of plate and velocity—and might vary 
the velocities over a considerable range. But in this latter experiment 
the simple linear regression function might not be adequate, and we 
shall restrict our illustration to the simpler situation. In more 
elaborate experiments, there may be several observable parameters 
corresponding to each of several factors for which it is impossible or 
inconvenient to assign specific levels. Ordinarily, when it is possible, 
factors are studied in experiments by assigning to them a specific set 
of levels rather than an observable parameter, because the analysis of 
the resulting data is simpler. 

Returning to the illustrative example, we have a two-factor experi- 
ment in a one-way classification. One factor (type of plate) is 
assigned specific levels which form the one-way classification, and the 

350 


—Á— HM ÓÀ— 


ANALYSIS OF COVARIANCE $14.12 


other factor (velocity) is represented by an observable parameter z. 
The data consist of mk pairs of observations (zy, zy) with 7 = 1, 2, 
++ ,kandj = 1,2, ++-,m. We wish to test whether either of the 
factors affect the subject (depth of penetration), and in particular 
whether the plates differ when effects due to differing velocities are 
removed, 

The sum of squares of deviations from the regression function for the 
observations in a single cell may be partitioned just as was done in 
(2.5) to obtain 


X (tq — e; = Bizy) Y (va — ài bizs)? + (Bi — 8)? » (ey — RO 
j 
Tom(E.— a — bZ)? (2) 


J 
where the sums on the right are independently distributed by chi- 
square laws (on division by c?) with m — 2, one, and one degrees of 
freedom, respectively. If now (2) is summed on 7, the total sum of 
squares will be partitioned into three parts independently distributed 
with k(m — 2), k, and k degrees of freedom, respectively. The result 
is 


D (vi — e; — Bu)? = D (we — 4 — Bizi)? + l (8; — B)? — Z)? 
+ my (Zi — o; — Bi)? (3) 


ij v 


We shall first investigate the hypothesis that the slopes of the 
regression lines are the same for all cells. To this end we write 


B =ß +y (4) 


and the null hypothesis then may be put in the form y; = 0. To test 
this hypothesis, the middle sum on the right of (3) is to be partitioned 
into two parts: one with k — 1 degrees of freedom involving the y: 
and the other with one degree of freedom involving £. 

If we let 


Wi; = X (£p — 5)* (5) 


then it is apparent from the middle term on the right of (2) that B; is 

normally distributed with mean 8; and variance c*/w;. Furthermore, 

the B; are independently distributed. If their variances were equal, 
361 


§14.12 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


one could partition (6; — 8)? directly into Z(B; — y; — B)? and 
k(B — 8)? with k — 1 and one degrees of freedom, but this is not the 
proper procedure here (see Prob. 23 at the end of Chap. 12). The 
deviations of the 8; must be taken not from their simple average but 
from their weighted average, say 


PS (6) 


Furthermore, 8 in equation (4) must be similarly defined so that the 
‘i represent deviations of the 8; from 


Zwii 


Zw; 


[iss (7) 


Now the middle term on the right of (3) may be partitioned thus: 
XB; — B)? = Y wl; — v: — Â) + (ê — oy 
- b wilh: — y. — Ê)? + (8 — gy Nw (8) 


since the sum of cross-product terms vanishes in view of (6) and (7). 
It follows from the result of Prob. 31 of Chap. 10 that the two terms 
on the right of (8) are independently distributed by chi-square laws 
with & — 1 and one degrees of freedom, respectively. Under the null 
hypothesis, y; = 0, the first sum on the right of (8) with the first sum 
on the right of (3) determines an F variate with k — 1 and k(m — 2) 
degrees of freedom. The other degree of freedom on the right of (8) 
provides an orthogonal test of the null hypothesis, 8 = 0. 

Turning to the third sum on the right of (3), we should like to parti- 
tion it so as to get an appropriate test of the hypothesis that the a; 
are all equal. Unfortunately this is not possible unless the 8; are all 
zero. However, it is possible to partition the sum to get some useful 
information about the ay, particularly when the 8; are equal. One sets 
up the null hypothesis, 


EG.) = a + B'z;: (9) 


which states that the cell means (Zi, Z;) fall, within experimental 

error, on a straight line; the nature of this hypothesis will be discussed 

further below, but now we proceed with the partition. The third sum 
352 


ANALYSIS OF COVARIANCE §14.12 
on the right of (3) may be written 
m >, (E. — e — 8'Z.) — (ai — a) — (8: — 8e? 
j =m}, (č: — 6: — a — 8'3)? (10) 


where 
à; = (a — a) + (b: — 8)z. 


Regarding the %;. — 8; as a new random variable, say w;, the sum of 
squares on the right of (10) may be formally partitioned just as was 
done in equation (2.5) to get 


m X (Z.— & = a= Ba)? = my (E. — ô: — â — Bu 
tony — BY @, — 2? + mka — 5- a — p'a) Qn 
in which, referring to equations (13.2.8) and (13.2.9), we have 


à = z7 — ĵ-— Bz (12) 
5, _ 2%. — §§ — E+ 4) (&. — 2) 
B DG. 2 a3) 


and subscripts ô have been put on these two estimators to indicate 
that they are functions of the unknown parameters à, Under the 
null hypothesis that the E(Z;.) are linear functions of the z;, (i.e., that 
the ô; = 0), these two estimators become 


âo =F — bë (14) 
& É, sa g Se 2) (15) 


the ordinary regression coefficients fitted to the points (Z;,, Z;); they 
are therefore called the regression coefficients for the cell means. 

The three terms on the right of (11) are independently distributed by 
chi-square laws on division by c?, the first with k — 2 degrees of free- 
dom and the other two with one degree of freedom each. The null 
hypothesis, ô = 0, would be tested by putting ô: = 0 in the first term 
on the right of (11) and comparing it with the first term on the right 
of (3) in an F test. The'nature of this null hypothesis is illustrated 
on the left of Fig. 67, where the solid lines represent within-cell regres- 
sions with equations x = d; + Biz, and the dashed line represents the 
regression of cell means x = à + Biz. The points on the solid lines 
are (Z;, Z), and the null hypothesis states that the expected values 

353 


814.12. EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


of the vertieal deviations of these points from the dashed line are 
zero. Rejection of the null hypothesis is good evidence that the cell 
parameters differ. However, as the right-hand graph shows, the cell 
means can be linear, and even though the within-cell slopes are the 
same, the o; are different. That is, one can accept 8; = 8 and à; = 0, 
yet it does not follow that the o; are equal. However if 8^ = 8, then 
it would follow that the o; were all equal. 


Fia. 67. 


Assuming now that ô: = 0 and 8; = 8 are acceptable hypotheses, 
let us construct a test for the null hypothesis 6’ = 8. The random 
variables 3 of equation (8) and 8 of equation (11) (putting 6; = 0) are 
independently normally distributed with means 8 and $’, and variances 
c*/ Zw; and s?/mZ(z; — 2)%. Their difference is therefore distributed 
normally with mean 8 — 8' and variance equal to the sum of the 
individual variances, Hence . 


(6 — ĝo — (8 — BP 
a) 1 1 
£ s * mG: — x 
[Ê — ĝi — (8 — gr 
sy (zu— 2)? 
F 


m ` wy (Z. — 2)? (10) 


has the chi-square distribution with one degree of freedom. The 
weighted sum of Ê and B, 
ZwÊ + mE, — 2:8; a7) 


is normally distributed independently of B — Bi [it is necessary only 
to show that the covariance between (17) and à — B; is zero] with 
354 


ANALYSIS OF COVARIANCE §14.12 
mean Zw;8 + mZ(z; — 226’ and variance e?Z(z; — z)*. Thus 


[2w.(8 — 8) + m(B — 6) 2E. — 2 (18) 


SX — 22 


has the chi-square distribution with one degree of freedom and is 
independent of (16). If the hypothesis 8 = $’ is accepted, then (18) 
provides a test of whether their common value is zero. The two 
independent degrees of freedom corresponding to B and Bj in (8) and 
(11) have been transformed to two other independent degrees of 
freedom (16) and (18). 

The complete partition of the sum of squares is exhibited in the 
accompanying table, in which all parameters have been put equal to 
zero. We shall review briefly the various tests: 


Degrees of 
Source Sum of squares pecan 
Deviations > (rij — à: — Bie)? k(m — 2) 
aj 
Ai — 8 Y ea — zi): — A) k-1 
ij 
ô; my, (E — & — [479 k —2 
T 
m(8 — &» Y, Wi D &. = 2)2 
B-p OPTANDUM oe 1 
(zi — 2)? 
[89 wi »& Y, e - 2T? 
m» i B 1 
B=8 Y Gu B 
i 
Total X (esi = 2)? km —1 
ij 
wae pe pe er E 


1. 6; — 8 — 0. If the regression lines for the individual cells all 
-have the same slope, then the second mean square (sum of squares 
divided by degrees of freedom) divided by the first mean square has 
the F distribution with k — 1 and k(m — 2) degrees of freedom. If 


this hypothesis is rejected, then it is concluded at once that both 
355 


814.13 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


factors affect the subject of the experiment, for if the regression 
coefficients differ, at least one of them must be different from zero. 

2. 6; = 0. The third mean square divided by the first mean square 
has the F distribution, whether or not the B; are equal, when the cell 
means are linear. 

3. 8 — 8' =0. The fourth mean square divided by the first has 
the F distribution if 8 = 8’, only if it is true that 8; = 8 and 6; = 0. 
One would not make this test if either of the first two null hypotheses 
were rejected. If all three of these null hypotheses are accepted, then 
it is inferred that the factor corresponding to the discrete classifica- 
tion does not affect the subject of the experiment (the a; as well as 
the B; are the same for all cells). 

4. B = 6’ — 0. This test would be made only if all three of the 
other null hypotheses were accepted. If this fourth null hypothesis is 
accepted also, then one infers that neither of the two factors affects 
the subject of the experiment. 

In many experiments there would be no thought of making all these 
tests; the primary object of the experiment might be to estimate the 
regression coefficients, it being well known in advance that both factors 
influence the subject. In such cases one would ordinarily make only 
the first test, in order to decide whether the same slope would suffice 
for all cells or whether a separate slope should be computed for each 
cell. 

14.13. Analysis of Adjusted Means. There is one other aspect of 
the analysis of covariance that needs to be discussed. We may refer 
to the illustration at the beginning of the previous section. Suppose 
it is found that both factors affect the penetration; the o; and £; are 
different for the different plates, but this was to be expected anyway, 
and these results are of minor interest, "The real question may be, Do 
the plates differ in their resistance for velocities z — Zo? (Thus Zo 
may be the ordinary short-range velocity of 50-caliber bullets.) 
Admitted that some plates may be particularly good for very high 
velocities while others may be better for low velocities, how do they 
rank at the velocity of real interest? : 

Using the notation of Sec. 12, the cell means z; correspond to veloci- 
ties 2;.; in fact 

Zi. = 4 + B. (1) 
With these regression coefficients we estimate that the cell means 
would have been 
Yi = & + Bizo = x — B — 2) (2) 
356 


ANALYSIS OF ADJUSTED MEANS 814.13 


if all the zi, had been equal to z. The y; are called adjusted cell means, 
and we are interested in testing the null hypothesis that the expected 
values of the adjusted means are the same for all cells. The y; are 
independently normally distributed, as follows from equation (12.2), 
with variances 

sic E Tees ay] (3) 

opm w 

and with means which may be denoted by s. Since the variances of 
the y; are different, we test the null hypothesis 7; = y by using the 
weighted sum of squares of deviations from their weighted mean, say, 


2(y/e) 
9 = x(/e © 


just as was done in the preceding section in testing the 8;. The sum of 


squares is 
1 F: 1 muw(y: — 9)? 
ia (yi 9r os » w: + ma: are 20)? © 


which has the chi-square distribution with k — 1 degrees of freedom 
when the s; are equal and which is distributed independently of the 
first sum on the right of (12.2). Thus we have an F test for y; = n. 

If the first null hypothesis, 8; = 8, of the preceding section is 
accepted, the z; are adjusted by the single regression coefficient B, 
and the adjusted means are 


Yi = Ti. — Êi. — zo) (6) 


"The variances of the y; then become 


oj =o" [2 Dorm us 20 @ 


and equations (4) and (5) are altered accordingly. In this case the 
sum of squares for the denominator of the F test is often taken to be 
the sum of the first two sums in the table of the preceding section. 
Thus the deviation sum of squares would be 


y (ty — & — Bie)? + 2; wil; < 6)? = Y (m d — Bais)? (8) 
a $ D) 

with km — k — 1 degrees of freedom. 

357 


814.14 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


In testing adjusted means, one would ordinarily choose z = Z 
unless there was good reason for not doing so. ; 

14.14. Notes and References. The general field of experimental 
design was first thoroughly explored by Fisher, whose book [1] remains 
today the most important treatment of the subject. It was originally 
published in 1935. Yates [2] has introduced many valuable new 
designs. The tables of Fisher and Yates [3] describe most of the known 
designs and give instructions for using them. 

The analysis-of-variance technique is also due to Fisher. Fisher 
used the test criterion 14 log F rather than F in his development. 
The latter version of the criterion is due to Snedecor, who named it F 
after Fisher. An excellent presentation of the practical aspects of 
experimental design and analysis of variance may be found in Snede- 
cor’s book [4], a large part of which is devoted to these subjects. 

We have given in this chapter merely the barest introduction to the 
subject. Only the simplest designs have been considered, and they 
have not been fully analyzed. The total sum of Squares may be 
further partitioned to study individual effects of factors and to study 
the linear, quadratic, cubic (and so forth) components of factors whose 
levels are chosen values of a continuous variate. Also the analysis was 
much simplified by assuming equal numbers of observations in the 
cells. When the cell frequencies are not equal, the analysis becomes 
much more tedious (except in the case of one-way classifications), 
primarily because the tests become nonorthogonal so that simple suc- 
cessive partition of the total sum of squares is no longer possible. The 
analysis of covariance can become quite difficult for more elaborate 
designs and more complicated regression functions ; we have dealt only 
with the simplest case. 

Most experimental work today is based on the rule: “Keep all 
variables constant but one,” an ancient and erroneous dictum which 
guarantees a high degree of inefficiency. One well-designed experi- 
ment, taking account of all relevant factors, is worth dozens or even 
hundreds of experiments which study one factor at a time keeping the 
others constant. 


1. R. A. Fisher: “Design of Experiments,” 4th ed., Oliver & Boyd, 
Ltd., Edinburgh and London, 1945. 
2. F. Yates: “Design and Analysis of Factorial Experiments,” 
Imperial Bureau of Soil Science, Harpenden, 1937. 
3. R. A. Fisher and F. Yates: “Statistical Tables,” 3d ed., Hafner 
Publishing Co., Ine., New York, 1948. 
358 


-r 


PROBLEMS §14.15 


4. G. W. Snedecor: “Statistical Methods," 4th ed., Iowa State College 
Press, Ames, 1947. 


14.15. Problems 

1. Test for differences between machines using the data of Sec. 3. 
'The computations are usually easier to do if the sums of squares are 
put in forms which do not employ deviations from means. Thus, 
when the n; are equal, say n; = m, 


Y nit - a) y-lyxi-ix and 


Yew — 2) = Say - SY x 
where X = » a, and Xi, = Yaw 


2. Use the data of Sec. 5 to test whether machine effects differ. 
Note that 


eyo.-o-lYm-Lix "Ya -s-1»xlm 


and that the deviation sum of squares may be obtained by subtracting 
these two sums from Sx — (1/re)X?. 

3. Referring to Prob. 2, find a 95 per cent confidence interval for 
the difference between the effects of the first and third machines. 

4. Four varieties of oats were compared on a block of land by 
dividing the block into 16 plots and using a 4 X 4 Latin square (chosen 
at random) in order to take account of possible fertility gradients in 
the soil. The resulting yields in pounds were found to be as follows, 
where the integers 1, 2, 3, 4 refer to varieties. Test for differences 
between variety effects. Was it worth while to use the Latin square? 


. 


§14.15 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


5. Analyze the following data taken from a much larger table: 


Reram Prices or BREAD 


New York Chicago Los Angeles 


Chain stores and super- 


markets.......+.... 14, 15.5, 15,13 | 14, 13, 11.5, 13 | 15, 15, 14, 13.5 
Supermarkets (not 
GHAI), e ES 14.5, 13, 12.5, 13 | 13, 13, 12, 13 13, 15, 14, 13.5 


Neighborhood stores..| 18, 15, 15, 17 15, 15, 16, 15 16, 20, 15, 18 


6. Analyze the following data: 


AvERAGE NUMBER OF CHILDREN PER FAMILY 


Cities Towns Rural Areas 


1 Family income 
White | Negro | White | Negro | White | Negro 


Under $4,000............... 


tien (pr eal o 
ven $4000: c A he, Vale 


yw 
HN 
to 02 
ao 


2 3.2 
8 2.9 


7. A paint-manufacturing company tests new formulas for outside 
paint by painting 12 panels of each of three kinds of wood (36 panels 
in all) and exposing them for 2 years in four climates (warm dry, cold 
dry, warm humid, cold humid), putting three panels for each type of 
wood in each climate. A group of paint technologists then score the 
panels on a scale from 0 to 100. Analyze the following data for four 
formulas: 


Type of Climate Xon 
wood 1 2 3 4 

1 21,15,17 | 56, 59,53 | 41, 38, 42 51, 47, 43 
2 20, 18, 19 61, 62, 62 | 46, 47, 45 55, 51, 54 

t 3 26, 30, 31 72, 67,70 | 50, 48, 54 64, 63, 66 
4 31, 34, 32 | 66, 64,67 | 54, 52, 55 64, 65, 64 
1 24, 20, 23 54, 54,56 | 39, 38, 39 50, 49, 50 
2 21, 25,25 | 58, 64,61 | 45, 44,45 | 54, 53, 52 

2 3 30, 31, 31 i P a Pro t 49, 48, 53 59, 61, 60 
4 33, 34, 30 74, 71,72 | 48, 56, 53 59, 62, 62 
1 14, 17,18` | 56, 55,52 | 42,40,40 | 48, 49, 47 

3 2 
3 


S 


21, 23, 22 | 61, 60,58 | 46,48, 50 | 53, 54, 55 
30, 30, 32 | 69, 71, T0 | 50,47,48 | 59, 62, 63 
36, 33, 35 | 68, 73,77 | 55,54,51 | 62, 66, 64 


360 


PROBLEMS 814.15 


8. A nutrition experiment studied the effects of five diets for 
fattening pigs for the market. Fifteen pigs, three for each diet, were 
put on the diets for 1 month. The following table gives the final and 
initial weights in pounds. Analyze the results. 


Diet 


1 2 3 4 2D 


118,72 | 102,70 | 91,63 | 104,65 | 93, 68 
108, 64 83,55 | 97,64 | 110,60 | 79, 65 
109, 63 99,01 | 92,62 95,57 | 96,69 


9. A first-grade teacher with 20 pupils decided to test for herself 
the merits of two methods of teaching reading. The class was divided 
into two groups of ten and the pupils given an intelligence test (I). 
At the end of the year they were given a comprehensive achievement 
test (A) in reading. Compare the two methods. 


Method 1 


Method 2 


10. Manufacturers of mass-production items often use statistical. 
methods to control variations in the quality of their product. One 
technique is to take periodic samples of items from the production 
line and measure some critical dimension or other property (hardness, 
breaking strength, electrical resistance, ete.). Thus one might exam- 
ine samples of size five every half hour over two 8-hour shifts, obtaining 
32 samples in all. How would you use these data to test homogeneity 
of the production process over time, and what assumptions do you 
require? "The null hypothesis is that no factors have crept in to alter 
the process—factors such as variations in incoming raw material; 
slipping of machine adjustments; failure of governors, thermostatic 
controls, etc.; differences in techniques of assembly-line workers; wear 
and tear on the equipment; and the like. If the null hypothesis is 
acceptable the process is said to be in control. 

11. Samples of three fuses were taken every hour for 2 days from a 
process making 10-ampere fuses. The fuses were blown and the cur- 


rent measured with the following results. Is the process in control? 
361 


§14.15 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


1 | 10.2, 10.1,10.3 | 9 | 10.0, 9.8, 9.8 
2 9.7, 9.9,10.4 | 10 9.8, 9.7, 10.0 
3 | 10.0, 10.1, 9.9 | 11 | 10.1, 10.1, 10.1 
4 | 10.1, 9.8, 10.3 | 12 | 10.3, 10.2, 10.3 
5 9.8,10.0, 10.2 | 13 | 10.0, 10.2, 10.0 
6 | 10.2, 10.1, 10.0 | 14 | 10.0, 10.1, 10.2 
1 9.5, 10.1, 9.7 | 15 | 10.1, 10.4, 10.1 
8 9.9, 9.9 9.7 | 16 | 10.5, 10.2, 10.4 


12. Referring to Prob. 11, let z be the mean of all observations and 
let s be the estimate of the standard deviation based on the within- 
sample deviations. Suppose now that another sample is drawn with 
measurements yi, ys, ys. How would you test (assuming normality 
and common variances) the null hypothesis that E(g) = E(z)? 

13. In quality-control work, after a collection of samples has been 
analyzed, a control chari is constructed. The chart is simply a set 
of three horizontal lines drawn on graph paper at Z, + 3s/+/m, 
& — 3s/+/m on the vertical scale. Here s is the within-sample esti- 
mate of the standard deviation, and m is the sample size. The central 
line is called the process average, and the other two lines are called 
control limits. One continues to sample the process periodically and 
plots the successive sample means as points on the chart (the abscissa 
of the ith sample mean is 7). When a point falls outside the control 
limits, the production process is halted and carefully examined for 
presence of disturbing factors. About how many times per thousand 
samples will the process be futilely examined if the process remains 
in control? 

14. In the above problem, the plotting of each point constitutes a 
simplified test of the null hypothesis described in Prob. 12, — Criticize 
this test. Under what circumstances would you regard the lack of 
independence between successive tests as not serious? 

15. Verify equation (5.7) of the text. 

16. Show that the expressions (5.21), (5.22), and (5.23) reduce to 
terms of (5.7). 

17. Work through the details of the derivation of the analysis-of- 
variance table of Sec. 7. 

18. Verify equation (8.2). 

19. Referring to the components-of-variance model of Sec. 9, sup- 
pose one wished merely to estimate the variance components c2, o}, o2 
and had no intention of testing hypotheses about them. Would it 
be necessary to assume normality? Would the obvious estimates 
determined from the analysis-of-variance table (by equating mean 
squares to expected values) necessarily be good estimates? 

362 


PROBLEMS 814.15 


20. What are the maximum-likelihood estimators of c2, of, o? of 
Sec. 9? 

21. Show that the four sums of squares in the first analysis-of-vari- 
ance table of Sec. 10 are independently distributed by chi-square 
laws. 

22. Derive the expected mean squares in the first analysis-of- 
variance table of Sec. 10. 

23. Verify equations (10.3) and (10.4). 

24. Derive the expected mean squares for the table of Sec. 11. 

25. Show that (12.18) has the chi-square distribution and is inde- 
pendent of (12.16). 

26. Verify equation (13.8). 

27. Verify the total in the analysis-of-covariance table of Sec. 12. 

28. In a two-factor experiment with each factor at two levels, it was 
possible to obtain only one observation for three of the cells and two 
for the fourth. Test for significance of the interaction. 


A, | 68 54 
A, | 50 | 55.1, 54.9 


29. Show that the analysis-of-covariance table would have been as 
follows had the cell frequencies been different, say m;: 


D f 
Source Sum of squares n 
Deviations by (£i; — ài — Bie? Dm; — 2k 
ij 
Bi — B Y, Ga — 2*8 — 8)? k-1 
D) 
ôi » mi. — áo — Bui)? k-2 
i 
(8— Ay, Wi D) mig. — 2)? 
5 i 1 
ru D(z; — 3) | 
BXw; + Emi; — 2)? 
BE CP tArm d 1 
(ei; — 2) 
"Total D (T — 2) am: — 1 
ij 


noiae Ree 
363 


814.15 EXPERIMENTAL DESIGNS AND THE ANALYSIS OF VARIANCE 


30. Express all the o's and 8's of the preceding table in terms of the 
wy and Zij. 

31. Test whether the regression function is of the form a + Bz + yz? 
given the. following observations (x, z) on a random variate x and an 
observable parameter z: (2.1, 0), (6, — 1), (6, 4), (1.9, 0), (0, 2), (6.1, 4), 
(0.1, 1). Do not work through the arithmetic; merely specify all the 
steps in detail. 

32. Using the data of Prob. 31, test whether the regression function 
is of the form 2 — 3z + z*. 

33. Discuss the problem of testing whether the means of two 
samples from normal populations with the same variance are equal. 
Use the analysis of variance for one factor at two levels, and com- 
pare the resulting test with the one given in Sec. 12.7. 

34. Consider a one-way classification with observations a; 


(212, ---,k and j-212,:-,m) 


there being unequal subelass:numbers n;. Show that the analysis- 
of-variance table for the components-of-variance model is: 


— 


Rourge Sum of Degrees of Expected 
squares freedom mean square 

Effects Yaa] k-i o? + nyo? 
i 

Deviations X Gai 4)? | N-k o 
d 

Total (ti; — 2) N-1 
D 


————————————ÉÓÉÉÓàÁÓ s 


where N = nj, c? is the error variance, c? is the effect variance, and 


Ane zn 
ds gan - 3) 


Observe also that no reduces to m if all n; = m. 


364 


CHAPTER 15 
SEQUENTIAL TESTS OF HYPOTHESES 


15.1. Sequential Analysis. Sequential analysis refers to techniques 
for testing hypotheses or estimating parameters when the sample size 
is not fixed in advance but is determined during the course of the 
experiment by criteria which depend on the observations as they occur. 

In Sec. 12.2 we considered the test of a null hypothesis against a 
single alternative. It was shown that for samples of size n, (zı, Xe, 
* + * , @p), the test which minimizes the Type II error for fixed Type I 


B(x) . A(x) 


Fic. 68. 


error is the likelihood-ratio test. Thus if the Type I error is chosen 
to be a, then a determines a number A by virtue of the equation 


SI- [DEE fads dena O 
M >A " 


where 


i f 1 (2;) 
n = 1 S 


and the critical region for rejection of Ho is the region 
Me > A (3) 


This critical region minimizes the probability 8 (Type II error) of 
accepting Hy when Hi is true. 
Suppose it is desired to fix both a and 8 in advance. One could do 
80 as follows if the sample size were at his disposal: first determine A, 
365 


§15.2 SEQUENTIAL TESTS OF HYPOTHESES 


as a function of n by means of (1), then determine £ as a function of n, 


Ba = ff [fief fd ++ de, (4) 


An <An 


and finally select n so that 8, has the desired value. 

Suppose further that for, say œ = .01 and 8 = .01, and for particu- 
lar functions fo(«) and fı(x), we had worked through the computation 
and found » to be 100. The following considerations make sequential 
analysis interesting both from the theoretical and practical viewpoint: 
In drawing the 100 observations to test Ho, it is possible that among 
the first few observations there may be one or more so far to the left 
that eventual rejection of Hy is out of the question and it would be a 
waste of time to make the remaining observations. In other instances 
the first 20 or first 30 or first 40 observations may provide quite 
sufficient evidence, relative to a and 8, for accepting or rejecting Ho. 
In short, the possibility is raised that, by constructing the test in a 
fashion which permits termination of the sampling at any observation, 
one can test Ho with fixed errors a and 8 and yet do so with fewer than 
100 observations on the average. This is in fact the case, though it 
may at first appear surprising in view of the fact that the best test for 
fixed sample size does require 100 observations. The saving in obser- 
vations is often quite large, sometimes more than 50 per cent. That 
is, in repeated tests of H against H, for fixed control of both errors, 
100 observations per test may be required for fixed sample sizes, but 
for sequential sampling and the same control of the errors, only 50 
observations per test may be required on the average. 

15.2. Construction of Sequential Tests. The theory of sequential 
testing has been developed only for the case of testing a null hypothe- 
sis Ho against a single alternative Hı. It will become apparent in the 
later sections of the chapter that this restriction is not serious in 
applieation of the methods to practical problems. We shall let Ho 
refer to a density function fo(x) and Hı to fil). Observations will 
be denoted by a1, 22, - - - , where the subscripts give the order in 
which the observations are taken. 

The sequential test employs the likelihood ratio 


m 
Nm fin) 1 

HE s 

and two positive numbers A and B, with A > 1 and B <1, As 
observations are made, one computes the ratios Ma, As, 3, + + 7 , and 


366 


CONSTRUCTION OF SEQUENTIAL TESTS $15.2 


continues taking observations as long as 
B«M«A (2) 


Tf, for some m, Nm is less than or equal to B, Hy is accepted and the test 
is completed. If X, becomes greater than or equal to A at some 
stage, Ho is rejected and the test is completed. The procedure then 
is to continue sampling until X, falls outside the interval specified by 
(2), at which time the sampling ceases. 

The first question that naturally arises is, What is to prevent the 
sampling from going on forever? It is easy to show that this cannot 
happen—that the probability is one that the process will terminate 
whatever the distribution of z. Let 


[i] o 


then z will have some density function, say g(z), determined by the 
density function of x [which need not be fo(z) or fi(x)]. The sequence 
of observations zi, ze, * * > determines a sequence of z observations 
21 Z% * * +. The sequence of inequalities (2) becomes 


log B < J zi < log A (4) 
1 


where log B is negative and log A is positive. Letc = log A — log B 
and let p be the area under g(z) between —c and c. Now if any one 
of the 2; falls outside the interval —c to c, one of the inequalities in (4) ` 
will necessarily be violated either at that stage or, if not then, at some 


g(2) 


=e c z 


previous stage. Hence if (4) is to hold for all m, at the very least 
every z; must fall between —c and c. (Of course the inequalities may 
be violated though all the z's do fall in that interval.) The probability 
that every z; falls in the interval is p" for the first m observations 
(sinee they are independent), and this probability approaches zero 
as m increases, since p is less than one. Thus (4) cannot remain true 
indefinitely. [In case g(z) is zero outside —c to c, one would define 
367 


§15.2 SEQUENTIAL TESTS OF HYPOTHESES 


new variables y;, letting yı be the sum of the first r, z's, ys the sum of 
the next r, z's, and so forth, taking r to be large enough that the non- 
zero range of the density function of y does not fall within —c to c] 

We turn now to the determination of A and B. The probability 
a that Ho will be rejected when it is true is found by computing the 
probability that X, will exceed A before it becomes less than B. It is 
clear that 


a = PA > A) + P(B <i < A, 2 > A) 
+P(B<u <A, B<2 <A> ADH::- (5) 


Similarly the probability 8 that Ho will be accepted when H; is true is 


B = P € B) -- P(B << A, ^ < B) 
+P(B<u <A, B< <A, € B) ---- (0) 


For two specified density functions fo(x) and fi(v) one could compute 
all these probabilities, using fo(x) in (5) and fi(z) in (6). It follows 
then that « and 8 are known functions of A and B ; hence if o and 8 are 
specified in advance, A and B are determined by (5) and (6). 

As might be anticipated, the actual determination of A and B from 
(5) and (6) can be a major computational project. In practice, they 
are never determined that way because a very simple and accurate 
approximation is available. The approximate formulas are 


Figs ie (7) 


R 


B=- Ê (8) 


and they arise from the following considerations. Suppose X, were a 
continuous function of a continuous variate m so that X, could be 
plotted as a curve against m, and suppose the test were performed by 
moving out along the m axis until X, first equaled A or B. That is, 
the test is continued as long as (2) is true and ceases when either 
Xm = B (Ho accepted) or X, = A (Hi accepted). At all points of the 
(a1, #2, * + +) space where Hy is accepted, the likelihood of H 1, Say Li, is 
exactly B times the likelihood Ly of Ao, since X = Li/Lo = B at those 
points. Hence the integral of Lı over those points is exactly equal to 
B times the integral of Ly over those points. But the first integral is 8, 
and the second is 1 — æ (the probability of accepting Ho when it is 
true). So we would have 8 exactly equal to B(1 — a) if continuous 
368 


POWER FUNCTIONS §15.3 


sampling were possible, and (8) would hold exactly. By a similar 
argument at An = A, (7) would be an exact equality if m were a con- 
tinuous variate. Since the error of using (7) and (8) is merely a 
consequence of the disereteness of m, one would expect it to be small, 
and analytical investigation shows that it is quite small when both 
a and £ are less than one-half. We shall not, however, look into this 
matter. 

Equations (7) and (8) make the actual performance of a sequential 
test astonishingly simple. It is not necessary to develop any sampling 
distribution theory at all; one merely selects a and 8 arbitrarily, com- 
putes A and B, and proceeds at once with the test. ` 

15.3. Power Functions. Let a density function f(x; 0) have one 
parameter 0 and let us test the null hypothesis, 0 = 6, against the 
alternative hypothesis, 0 = ð. We are interested in the behavior 
of the test for all possible values of 6. In particular, we shall examine 
the power function of the test, P(@), which is the probability that 6, 
will be rejected when 6 is the true parameter value. Of course 


P(60) = a (1) 
P(i) =1—8 (2) 


and (supposing for definiteness that 9 < 61) we should expect the 
power function to have somewhat the shape of the curve of Fig. 69. 


a 
& 0 8 
Fra. 69. 


The straightforward way to compute P(0) is simply to add the prob- 
abilities that Ho will be rejected at each observation. Thus 


P(0) = PQu > A) + P(B «x < A, > A) 
+ P(B <i <A, B <2 A; Xs > A) 4+ 5 (8) 
` 369 


§15.3 SEQUENTIAL TESTS OF HYPOTHESES 


where, for example, 
P(B <N LAN A) = EC 8)f(zs, 0)dz:s dos (4) 
E 


and the double integral is taken over the region R in the zı, z» plane 
defined by the inequalities 


f&n 061) S(a1, O1)f (x2, 01) 
Bu) <4 Fees, Oolfl@a, ) > A ©) 


This procedure for determining the power function is tedious to say 
the least and is usually so troublesome as to be completely out of the 
question in practice. 

To avoid the use of (3), a very ingenious device has been developed. 
We shall present it without a formal proof of its correctness, merely 
giving the general pattern of the proof. The argument requires first 
the existence of a nonzero number A such that 


„p = | Les 0) | è ‘ 
ates) = | Fes | sles) (6) 

is a density function; i.e., a number h such that 
Lb g(a; A)dx = 1 (7) 


Of course h = 0 will make g(x; 0) a density function, because f(x; 0) 
is a density function. To show that such a nonzero value of h exists, 
we consider the expected value of [f(z; 61)/f(; 05)]" as a function of u, 
say $(u), 


«e = [7 [E res ae © 


Obviously ¢(u) is always positive, and furthermore $(0) = 1. We 
can also argue that $(u) becomes infinite when u approaches infinity 
in either the positive or negative direction. Since f(x, @:) and f(x, 6») 
differ, there will be an interval or set of intervals where their ratio is 
greater than one. Over such intervals the integrand becomes large 
with increasing u, and ¢(u) — © asu— «c. Similarly there will be 
intervals where the inverse ratio is greater than one and the integrand 
becomes large for large negative value of u. This is enough to show 
the existence of h. (Of course, ¢(u) may have a minimum at u = 0, 
in which case h would not exist, but this can happen only for particular 
370 


POWER FUNCTIONS $15.3 


values of 0, not in general) So far as our argument goes, there may 
be several values of u for which ¢(u) = 1. Actually there is only one, 
for the shape of ¢(w) is as illustrated in Fig. 70; the minimum, though, 
may be to the left of the origin so that h may be negative. Thus there 
exists a nonzero h in general such that $(h) = 1, and (6) is therefore a 
density function. 


glu) 


Fie. 70. 


One now sets up a sequential test of the null hypothesis Hj that the 
density function is f(x; 0) against the alternative hypothesis H1 that 
the density function is g(x; 0). Of course the null hypothesis here is 
true by assumption. The limits for the likelihood ratio are taken to 
be A^ and B^, Thus the test continues as long as 


n — gri Oglar; 0) * * * gn; 0) n 
BY < Fe; es Ie <4 (9) 


and ceases when the ratio equals or falls outside these limits. We are 
assuming here that A is positive; if it is negative, A and B are inter- 
changed. In view of (6) the test defined by (9) is exactly equivalent 
to the original sequential test under consideration ; i.e., (9) is equivalent 
to 


SCE; Oxf (a2; 03) + © © Sams Om) 
B« f(x; 6f(22; 00) © + + f(tn; 09) «A (10) 


Thus the rejection of Ho implies the rejection of Hj. But we can 

compute at once the probability that Hj will be rejected when Hj is 

true [f(r; 0) is the true density function]; hence we have P(6) for 

f(x; 0). Hi will be rejected when it is true with probability o^ and 

accepted when H/ is true with probability 8' where, in accordance with 
371 


815.4 SEQUENTIAL TESTS OF HYPOTHESES 


(2.7) and (2.8), 
A^ c ILE (11) 


pw 8 (12) 


On solving this pair of equations for a’, we find 


h 
a’ = P(0) a (13) 
Thus to find the ordinate of the power function at a point 6, one first 
finds the function ¢(u) defined by (8) for that value of 0; then puts 
$(u) = 1 and solves for u; the nonzero root is the number h of (13), 
which then determines P(6). 

As an illustration, let us eonsider the null hypothesis that the mean 
of a normal distribution is uy against the alternative that the mean is 
Ha (with uo < wx), assuming that the variance c? is known. We wish 
to find the probability P(x) that uo will be rejected when the true mean 
is u. The function $(u) is 


i3 1 gu (11) 2/204] 
= —{(@—n)2/203) T 
oe [ o Vro f Gee ean us nu 


The integral is easily evaluated, and on putting $(u) = 1 and solving 
for u, we find that one root is u = 0 while the other is 


h = Hit uo — 2u 


5 
Hi — Ho ae) 


On substituting this expression for A in (13), we have an explicit 
formula for P(x) in terms of H. 

15.4. Average Sample Size. The sample size n in sequential testing 
is a random variable with a density function, say p(n), which may be 
determined in terms of the true density function f(x; 0). Thus 


p(1) = PQ: < B) + PQ. > A) (1) 
»(2) = P(B < M < A, M < B) -P(B < M <A, M> A) (2) 


and so forth, where the probabilities on the right are determined 
by integrals like that of equation (3.4). In this section we shall 
find an approximate expression for the expected sample size E(n) and 
then ilustrate the extent to which sequential methods may save 
observations. 


372 


AVERAGE SAMPLE SIZE 815.4 


Let 

f(x; 61) 

fir; 8) e 
and let n be the smallest integer for which zı + 2» + * * * + Zn = Zn 
does not satisfy 


z = log 


log B < Z, < log A (4) 


We shall show that the expected value of the variate Zn, which 
depends on the random z's and the random variate n, is simply 


E(Z,) = E(n)E@) (5) 


To do this, we let N be some very large but fixed value of n and disre- 
gard that part of the distribution of n to the right of N. The resulting 
error can be made arbitrarily small by taking N sufficiently large. 
Since N is fixed, it follows that 


E(Zy) = NE® (6) 
The variate Zy may be put in the form 
Zy = Za + Ws 0) 
defining another variate Wn, and by virtue of (6) 
E(Z, + W.) = NEG) 8 


The trouble with trying to get (5) directly is that the range of 2i 
depends on whether i < n ori >n. In the latter case E(z;) = E(2), 
but when 7 < n, the range of 2; is restricted by (4). Now in (8) the 
variate W, consists of z's with i > n, so that the expected value of 
each z in W, is E(z). Thus 


E(W,) = E(QE(N — n) (9) 


where the second factor on the right.depends only on the distribution 
ofn. Combining (8) and (9), 


NEG) = E(Z,) + ECW) (10) 
= E(Z,) + EOIN — En) (11) 

which is the same as (5); solving for H(n), 
EQ) = We (12) 


This last expression enables one to get a simple approximate formula 
for the expected sample size. The variate Z, takes on only values 
373 


§15.4 SEQUENTIAL TESTS OF HYPOTHESES 


beyond log A and smaller than log B. If one ignores the amounts 
by which Z,, exceeds log A or falls short of log B, he may say that Zn 
takes essentially only two values, log A and log B. When the true 
distribution is f(x; 0), the probability that Z, takes the value log A is 
P(0), while the probability it takes the value log B is 1 — P(@). Hence 


E(Z,) = P(0) log A + [1 — P(6)] log B (13) 
which together with (12) gives 


P(8) log A + [1 — P(0)] log B 


En) = E) (14) 


This result enables one to compare sequential tests with fixed-sample- 
size tests. 

As an illustration, we shall consider the test that u = 0 against 
p = 1 for a normal population with unit variance. We shall choose 
a = .01 and £ = .01; then (2.7) and (2.8) give A = 99 and B = 16s. 
Let us further assume that the true parameter value is zero so that 
P(@) in (14) is just .01. Also we need to compute the expected value 
of 


e-[@—1)2/9) 


il 
2 = log am = 2-5 (15) 


which is —14 under the true distribution. Thus 
lé 
E(n) & .01 log 99 EE log 169 
TA 


£ 1.96 log 99 = 9 (16) 


To get the same control of the two errors with a sample of fixed size, 
we recall that the best test is made by choosing a number c and 
accepting or rejecting u = 0 according as 7 is less than or greater than 
€. The probability æ that H will be rejected (under u = 0) is 


| eat 1 s 
e= — e (n/2)z? di = je ane | —t2/2 t 
P Í V 2r Vie. 5 


so that for a = .01, 
À ync = 2.326 (17) 
The probability 8 that Hy would be accepted under H, (u = 1) is 


B= x2. Í emade dz = 5 MN etd di 
_374 | 


SAMPLING INSPECTION §15.5 


so that for 8 = .01, 
vn (c — 1) —2.326 (18) 


On solving (17) and (18) for n, we find it to be 22. Thus in repeated 
tests of the hypothesis in question, the sequential procedure would 
require on the average only 94» or 41 per cent as many observations 
as the fixed-sample-size procedure. 

15.5. Sampling Inspection. A particularly important application 
of sequential testing is in inspection of manufactured items. Large 
consumers such as retail chains, assembly plants, government agencies, 
and the like usually contraet for periodic deliveries of items in large 
groups called lots. Certain specifications for the items in question 
are stipulated in the contract, and it is further stipulated that the 
items shall be inspected or partially inspected to ensure that only a 
small proportion of the delivered items fail to meet the specifications. 
Ordinarily, defective items are not so crucial as to warrant the expense 
of complete inspection of all items, and sampling inspection is used. 
That is, the supplier will inspect a sample of the items of a lot and 
estimate the proportion of the lot defective. If the quality of the lot 
appears satisfactory, it is delivered; otherwise it may be sold to a less 
exacting consumer, or to the original consumer at a lower price, or it 
may be completely inspected (if the inspection is not destructive) 
and the defective items removed. When sampling inspection is to 
be used, the actual sampling procedure is often a part of the contract. 
The supplier does not guarantee that the proportion of defective items 
in submitted lots will be smaller than a given amount; he merely 
guarantees to submit only lots which have passed a specified sampling 
inspection test. 

The simplest sort of sampling inspection plan is the so-called single- 
sampling plan. One inspects a sample of size n and accepts the lot as 
satisfactory if the number of defective items is less than or equal to à 
given number c; otherwise the lot is rejeeted. The probability of 
accepting a lot under such a plan depends, of course, on the proportion 
of defectives in the lot. The density function for the number of 


defectives x is ) ( ) 
M\ (N -M 
Be Nee NS ene (1) 


where N is the lot size and M is the number of defectives in the lot. 
375 


§15.5 SEQUENTIAL TESTS OF HYPOTHESES 


This distribution is somewhat troublesome to work with, and since n 
is usually quite small relative to N, it is customary to approximate the 
function by the binomial 


ro = (ra -= (2) 


where p = M/N is the proportion of defectives in the lot. 

The performance of a sampling inspection plan may be portrayed 
by the operating-characteristic curve, which is simply a graph of the 
probability of accepting the lot plotted over the range of p. This 
probability for the single-sampling plan is 


Lip) = Y «o2 Y ro 6) 


using the binomial approximation as we shall do in this and the next 
section. An operating characteristic is plotted in Fig. 71. If, for 
example, one wished to pass all lots with 6 per cent or less defective 


0.10 0.15 0.20 


Fra. 71. 


025 030 p 


and reject all lots with more than 6 per cent defective, the ideal operat- 
ing characteristic would be the dashed curve of Fig. 71. This could 
not be achieved without complete inspection. Sampling inspection 
will necessarily reject some of the acceptable lots and will accept some 
lots which should be rejected. The more sampling one is willing to do, 
the more nearly he can force the operating characteristic to approxi- 
mate the ideal operating characteristic. The actual extent of the 
sampling in any instance depends, of course, on various economic 
factors associated with the particular problem at hand—factors such 
as production cost per item, inspection cost per item, difference in 
market value of accepted and rejected lots, ete. 
376 


SEQUENTIAL SAMPLING INSPECTION §15.6 


Sampling inspection plans may be regarded as procedures for testing 
hypotheses. Thus the single-sampling plan is just the procedure one 
would use to test the null hypothesis that the parameter p of a binomial 
distribution has the value po against alternatives p > po. Given a 
sample size n and a specified size a for the Type I error, the Type II 
error would be minimized for any p > po by choosing the integer c for 


which 
c 
Y (2) it - parr 21-2 (a 
0 x 


and rejecting the null hypothesis when «>. It is to be observed 
that the operating-characteristic function is simply one minus the 
power function of the test. 

Somewhat more sophisticated inspection plans use double sampling. 
A small sample of size 7; is examined, and the lot may be accepted or 
rejected on the basis of this sample. But in borderline cases asecond 
sample of size ns is examined before the lot is finally classified one way 
or the other. Formally the procedure is: 

1. Examine a sample of size nı. 

2. If zı (number of defectives in nı) < cı, accept the lot. 

3. If xı > cs, reject the lot. 

4. If c < 21 < cs, examine a second sample of size n». 

5. If xı + 2» < cs, accept the lot. 

6. If zi + 22 > ca, reject the lot. 

This procedure contains the germ of the sequential idea. It is better 
than single sampling in the following sense: Given a single-sampling 
plan with sample size n and a double-sampling plan with average sam- 
ple size n, one can more nearly approximate the ideal operating char- 
acteristic with the latter. Or in other words, for a given operating 
characteristic double sampling will require on the average fewer 
observations than single sampling. 

15.6. Sequential Sampling Inspection. We shall suppose that large 
lots are being dealt with, so that the error of using the binomial distri- 
bution is of no practical importance. Let us further suppose that the 
supplier's production process, when all is well, produces about 2 per 
cent defectives and that the sampling inspection plan is supposed to 
accept most lots with less than 3 per cent defective and reject most 
lots with more than 3 per cent defective. This is the usual situation; 
a supplier who contracted to provide better quality than his production 
process was capable of would have little use for sampling inspection. 

377 


§15.6 SEQUENTIAL TESTS OF HYPOTHESES 


In setting up a sequential plan, one must first put the test in terms of 
a null hypothesis and a single alternative. Thus in the present 
instance one might test the null hypothesis po = .025 against the 
alternative p: = .04, accepting the lot whenever the null hypothesis is 
accepted. In general, two values po and pı are chosen and two prob- 
abilities œ and 8 for the Type I and Type II errors. Thus one has at 
his disposal two points on the operating characteristic: (po, 1 — o) 
and (pı, 8). One could make the inspection plan very critical at 
p = .03 by choosing, for example, the two points (.029, .999) and 
(.031, .001), but in doing so he would ensure that considerable sampling 
would be done. The actual choice of these two points depends on 
economie considerations. 

The individual observations y; have the density function 


pL — p) (1) 
and if » yi is denoted by a, the likelihood ratio is 
1 


EDI pines 

pi(l — pj) 

Observations are taken until either A, X B, in which case the lot is 

accepted, or \, > A, in which case the lot is rejected. A and B are 
computed from (2.7) and (2.8). 


To get the operating characteristic, one first finds ¢(u), which is 
simply 


(2) 


sp pil — p) |" 
so -ra e 
1 
- X pa Toe ey (i = Bp 
iso [y 1 — po 
snas ESSAYE 
= p|% TES lr 
»(2) + (1 -p) t = B) (4) 
m. im d h of Sec. 3 is the nonzero root of $(u) = 1, so that his 
efined by 
h h 
1 E 
»(2) ta-»(izmy -1 (5) 
"This equation together with 
A^ — 1 
L(p) — AY Bi (6) 


[obtained by subtracting both sides of (3.13) from one] determine the 
operating-characteristie function. Since the solution of (5) for h is 
378 


SEQUENTIAL SAMPLING INSPECTION §15.6 


a troublesome computation, one computes points on the curve by 
choosing values for h arbitrarily and calculating the corresponding 
values of p and L(p) from (5) and (6). 

Often a sufficient appraisal of the operating characteristic can be 
obtained from five easily computed points on the curve: 


L(0) = (7) 
EQ) = (8) 
L(p)-1-—24 (9) 
L(p) = (10) 


pue log it B E 
where 
log (1 — p9)/(1 D] 
P' = jog (pip) — log [d — Pd/C — Pol (12) 


The fifth point [p', L(p’)] is between po and p: and corresponds to 
h = 0; the formulas (11) and (12) are obtained by letting h approach 
zero in (5) and (6), which become indeterminate at h = 0. 


Po P P 


Fic. 72. 


The average-sample-size curve may be plotted easily after Lp) has 
been plotted. Referring to equation (4.15), the ordinate of this curve 


(Fig. 72) is given by M Eb 
~ [1 = L(p)] log p) log 13 
En) = Gp A mlu[ü-»J-»9 © 


where we have substituted 1 — L(p) for P(p) and 


BS tog prs em 14 
TOG z| pi — po) og 


ee rea 
= p log p + ( p) 
379 


pi (15) 
1 — Po E 


§15.7 SEQUENTIAL TESTS OF HYPOTHESES 


The maximum value of E(n) occurs very nearly at the point p’ given 
by (12). At that point, (13) becomes an indeterminate form whose 
limiting value is 
log A log B 
log (pi/po) log [(1 = p3/(1 — po)] 


This is approximately the maximum average sample size and occurs 
when the true proportion defective has the value given by (12). 

15.7. Sequential Test for the Mean of a Normal Population. Asa 
final example of sequential testing, we shall consider the two-sided 
test of the null hypothesis Ho that the mean of a normal population 
has the value yo. It is assumed that the variance c? is known. It is 
necessary to frame the test in terms of a single alternative Hy. If 
we were interested in a one-sided test, say against alternatives u > po, 
we should simply choose some arbitrary value pı (greater than po) 
for the alternative. But that alternative will not serve for the two- 
sided test, because the power function approaches zero as u moves to 
the left. 

. The trick here is to phrase the hypotheses in terms of another param- 
eter 6 which measures the distance of u from po. The new parameter 
8 takes only positive values and is defined by 


(16) 


ô=p— po —— inu (1) 

=po—p if u < po (2) 

The null hypothesis is now 6 = 0, and the alternative is 6 = 61, where 
ô is an arbitrarily chosen number. Now one must set up a somewhat 
artificial alternative distribution function, because the number 61 
actually refers to two distributions—one with mean Ho — à; and one 
with mean up + ô. The alternative density function is defined to be 


i erage ee foal 
Ya e l(z—uo+81)?/203) i 
AG 24/2rc : 24/2r6c 


€ 2—10—91)1/2e1] (3) 


which is clearly a density function. Under H o the density function is, 
of course, 


1 


To 


foz) = 


€ A G—n9)*/2e1] (4) 


It is apparent that the likelihood ratio will behave as we wish. If p 
is to the left of uo — 6;, the ratio fı/fo will usually be large because of 
the first term on the right of (3), while if H > po + 41, it will be large 
because of the second term. 

380 


SEQUENTIAL TEST FOR THE MEAN OF A NORMAL POPULATION §15.7 


The test is now performed in accordance with the usual procedure. 
One chooses probabilities œ and 8 for the two types of error and com- 
putes A and B from (2.7) and (2.8). For a very sensitive test one 


Plu) 


]j————————————Àrá t 


Mo ford ^ 


Fie. 73. 


would choose 5; as well as a and 8 to be small. Observations are 
made until 


M = nico (5) 


exceeds A or becomes less than B. 

The test given here for Hy is merely one of many possibilities. We 
have been quite arbitrary in setting up the alternative density func- 
tion, and it is entirely conceivable that some other form might improve 
the test, might reduce the average sample size under the null hypothe- 
sis, for example, or might have other desirable properties. 

When the variance is unknown, several tests are available; most of 
them use weight functions of one kind or another. Perhaps the 
simplest test is that based on the ¢ distribution. If we denote by 
gn(t; u) the density function for ¢ with n degrees of freedom and with y 
the mean of the normal population, then one may define 


gn(t; ma) (6) 


M = gs no) 


with n = 2, 3, 4, ++ +. Although this function is not of the same 
type as the others we have considered (because the numerator and 
denominator are not products of density functions of independent 
variates), it ean be shown that the test terminates and that (2.7) and 


(2.8) determine A and B as before. 
381 


§15.8 SEQUENTIAL TESTS OF HYPOTHESES 


The criterion (6) refers, of course, to the one-sided test of yo against 
an alternate pı greater than wo. For a two-sided test of uo, one may 
use 
Ygn(t; uo + 01) + gnlt; uo — 81) (7) 


An = 
g»(t5 Ho) 


where 6 has the same meaning as in (1) and (2). 

15.8. Notes and References. Sequential analysis is a quite recent 
development in the theory of statistics, having been started in 1943, 
The theory is due primarily to Wald [1], whose excellent and quite 
readable book on the subject contains most of the developments made 
up to the date of its publication. Wald’s work has stimulated much 
research, and the techniques of sequential analysis will doubtless be 
extended considerably during the next few years. 

Thus far, most attention has been given to the matter of testing 
hypotheses, but sequential methods also promise to increase the 
efficiency of estimation procedures. The problem here is to choose in 
advance a 1 — a confidence interval of specified length and make 
observations until the confidence interval can be said to cover the true 
parameter value with the desired probability. 

The matter of testing composite hypotheses requires further develop- 
ment. Wald has shown that this problem may be dealt with by means 
of certain weight functions chosen in an optimum fashion. But a 
detailed general theory is not yet available. 

A good exposition of sampling inspection from the practical point 
of view is given by the second reference, 


1. A. Wald: "Sequential Analysis,” John Wiley & Sons, Inc., New 
York, 1947. 
2. H. A. Freeman, M. Friedman, F. Mosteller, and W. A. Wallis: 


“Sampling Inspection,” MeGraw-Hill Book Company, Inc., New 
York, 1948. 


15.9. Problems 


1. Perform a sequential test of the null hypothesis that p = .45 
against the alternative that p = .30. Let p refer to the probability 
of a head in tossing a coin, and carry through the test by tossing a 
coin using a = .10 and 8 = .10. The arithmetic is simplified by solv- 
ing log \, = Band log), = A for 2, (the number of heads in n tosses), 


thus obtaining acceptance and rejection numbers as linear functions 
of n. 


PROBLEMS §15.9 


2. Show that equation (3.13) is correct when h is negative. 

3. Assuming a lot has size N with M defectives, what is the exact 
expression for the operating-characteristic function? 

4. Show that the ratio ?$9 obtained at the end of Sec. 4 depends 
only on the values of a and 8 and not on the sizes of o°, po, and gi. 

5. Compare the average sequential sample size with the fixed 
sample size for the one-sided test of the mean of a normal population 
when a = .01, 8 = .05, and the alternative hypothesis is true. 

6. Show that the one-sided test for the mean of a normal population 
with known variance may be performed by plotting the two lines 


o? 
Hı — Bo 
o? 


Hi — Bo 


y= log B+ ME Hy 


log A + pot n 


y= 


n 
in the n, y plane; then plotting X x; against n as the observations are 
1 


made. The test ends when one of the lines is crossed. 
T. Referring to Prob. 6, let c = (uo + :)/2 and let the two con- 
stants in the equations be denoted by b and a; i.e., 


c? log A 
Hi — Ho 


Show that the power function for the test may be put in the form 


= e2(e—n)b/ 0? 
Plu) S cue: — gener 


8. Referring to Probs. 6 and 7, show that the expression for the 
average sample size may be written 


wb + Pü)(b — a) 
E(n) = TAGS 
9. Verify equations (6.11) and (6.12). c y 
10. Plot the power function and average-sample-size function for 
the test of Prob. 1. E i 
11. Plot the power function and the average-sample-size function 
for the test that the mean of a normal population is zero against the 


alternative that it is one. Let o? = 1, e = .01, 8 = .05. 
383 


§15.9 SEQUENTIAL TESTS OF HYPOTHESES 


12. Find formulas for the power function and average sample size 
for sequential tests on the mean of a Poisson distribution. 

13. Suppose a production process produces lots of size N with M 
defectives in such a way that M has a binomial distribution. Show 
that a sample of size n (with x defectives) can provide no information 
about the proportion of defectives in the remaining N — n items of a 
lot. 

14. Suppose lots which are rejected under a sequential sampling 
inspection procedure are completely inspected and the defective items 
replaced by good items; this is a common practice. Let p be the 
proportion of defectives in the original lots. What will be the average 
proportion of defectives over all delivered lots counting both those 
completely inspected and those passed by the sampling plan? This 
function of p is called the average outgoing quality function; the maxi- 
mum of the function is called the average outgoing quality limit. Make 
a rough sketch showing the general shape of the function. 

15. Referring to the situation described in Prob. 14, find the average 
percentage of items inspected as a function of D, counting both passed 
and completely inspected lots. Make a rough sketch showing the 
general shape of the function. 

16. Suppose a uniform distribution has the range 0 < x < 0. Dis- 
cuss the sequential test of 0 = 0, against 0 = 6, with 0) < 64. Be 
careful here; some of the general formulas may not be applicable. 


17. By an argument similar to that used to obtain (4.5), Wald has 
shown that 


Elegy = 1 


where ¢(é) is the moment generating function of z, i.e., o(t) = E(e*), 
and where the expectation E is over the joint distribution of the z's 
and the random variable n. This is called the fundamental identity 
of sequential analysis. Use it to obtain (4.5). 

18. Use the identity of Prob. 17 to show that 


E(Z;) 
E(n) = EG) 
when H(z) = 0. 

19. Use the result of Prob. 18 to obtain (6.16). 

20. Use the result of Prob. 18 to show that the maximum average 
sample size for one-sided tests of the mean of normal population is 
approximately —ab/s?, where a and b are defined in Prob. 7. Assume, 
do not try to prove, that the maximum occurs at h = 0. 

384 


CHAPTER 16 
DISTRIBUTION-FREE METHODS 


16.1. Introduction. In Sec. 7.8 the important place ascribed to the 
normal distribution in statistical theory was justified on the basis 
that any known continuous distribution could be transformed to the 
normal distribution. But, of course, experimenters quite frequently 
have no knowledge of the form of the distribution with which they are 
dealing, or at least so little information that they cannot prescribe 
a normalizing transformation. Until recently there was not much 
to be done in this situation, and experimenters were more or less forced 
to make wholesale assumptions of normality. During the past few 
years, however, techniques haye been developed for estimating param- 
eters and testing hypotheses which require no assumption about the 
form of the distribution function. These techniques are called non- 
parametric methods, or better, distribution-free methods. While the 
collection of distribution-free methods is not nearly so comprehensive 
as that based in normal theory, a good beginning has been made, and 
this chapter will present some of the results. 

Heretofore in denoting a sample by tı, 2s, * * * , €», the symbol zi 
referred to the first observation made, x2 to the second, and so on. 
"Throughout this chapter the notation will be interpreted quite differ- 
ently. The symbol 2; will refer to the smallest of the n observations, 
x» will represent the second smallest of the observations, and so on, 
with z, the largest. Thus, for the sample of four observations, 2, —4, 
—1, 1, zı refers to the second observation, zs to the third, xs to the 
fourth, and x, to the first. The phrase ordered sample is often used to 
indicate this interpretation of the notation. Distribution-free meth- 
ods are based entirely on these ordered observations, or order statistics. 

The methods to be presented are applicable to both continuous and 
discrete variates, but we shall direct our attention almost entirely 
to the continuous case, merely pointing out occasionally the modifica- 
tions that would be required in the case of discrete variates. 

16.2. A Basic Distribution. The whole structure of distribution- 
free methods rests on a simple property of order statistics: the distri- 


bution of the area under the density function between any two ordered 
385 


816.2 DISTRIBUTION-FREE METHODS 


observations is independent of the form of the density function. To 
show this, we merely make the probability transformation described 
in Sec. 6.1. The density function for the ordered sample 21, zs, * * * , 
In is 

nif(ws)f(w2) + >» f(tn) (1) 
if f(x) is the population density function. The factor n! arises from 
the fact that there are n! permutations of the observations and every 
permutation gives rise to the same ordered sample. The density for 
any given permutation is just If(x,), so the density for the ordered 
sample is obtained by summing this expression over all permutations 
of the z;. The variates in (1) are restricted by the inequalities 


=o << mm LTs" cm < o (2) 
If we let 
u= [7 f) =F) G) 
then in accordance with Sec. 6.1, the density function for the w; is 
simply 
g(t, Ua, **:,u) =n! 0 <u < te Ss! Stn <1 (4) 


which does not depend on f(x). 

The density function g(w, * * * , Un) enables one to find the dis- 
tribution of any set of areas under f(x) between pairs of ordered obser- 
vations. For example, suppose we desire the density function for the 
area under f(x) between zı and x,. This area, say v, is 


v = Ft) — F(a) =u, — wy (5) 


We first integrate out us us * * * , ua in (4), then make the sub- 
stitution u, = u, + v and integrate out w. Thus 


h(a, Un) = ( er [fo nau dus * > © duni (6) 


(un - u)” 


EES 0<u<u,<1 (7) 
and the density of u, and v is 
klun 0) = n(n — ij 0 <u < (1 — 0) <1 (8) 
On integrating out 4, we obtain the required density 
mv) = n(n ~ 121 =v) O<v <1 (9) 


which is a beta density function. 
386 


LOCATION AND DISPERSION 816.3 


More generally, the area w between z, and x, (s >r) under f(x) 
ean readily be shown to have the density function 


n! 
(@—7r—-Din—-s—n! 


A] — wr 0 <w <1 (10) 


Also one can obtain a joint density function for several such areas. 
The expected value of u; is 


E(u) = fe ve fe f niu du dus Doc dun 


i 
n+l an 
hence the expected area under f(x) between two successive observa- 
tions is 
1 
E(u) — E(w) = WEE (12) 
Thus, on the average, the n ordered observations divide the area under 
f(x) into n + 1 equal parts of area 1/(n + 1) each. 

16.3. Location and Dispersion. In the parametric case we have 
used the population mean and standard deviation as measures of 
location and dispersion. The distribution-free methods use other 
measures. The center of the population is defined to be the median, 
say », which is the point that divides the area under the density 


function in half. Thus v is defined by 
M = [" feas = FO) () 


where f(x) is the density function and F (æ) is the cumulative distribu- 
tion. The median is often denoted by £o and a similar notation 18 
used for other percentage points; thus 


F(£s) = 15 (2) 


defines the 15 per cent point, £15, of the population. 
As a measure of dispersion one uses the distance between two per- 


centage points. Thus, one frequently used measure of dispersion is 
(T: 505— £s — Eos (3) 


which is called the 50 per cent range, or the interquartile range. But 
many other ranges are often used, for example, the 90 per cent range 


T. = £95 — Eos, or the 3314 per cent range T% = ta — Ey. 
387 


816.3 DISTRIBUTION-FREE METHODS 


Point Estimation. The population median v is estimated by the 
sample median 4, which is the middle observation if the sample size 
is odd or the average of the two middle observations if the sample size 
iseven. Thus 

Č = Vey ifn = 2k +1 (4) = 
= VY (ae + Trp) ifn = 2k (5) 


The sample median 4 is not ordinarily an unbiased estimate of v even 
when n is odd, for the fact that E[F(2)] = F(v) does not imply that 
E(@) = v. However the bias is not serious and must approach zero 
as the sample size increases. 

To estimate percentage points, the x; themselves furnish estimates 
of the 1007/(n + 1) per cent points. For other values one may use 
linear interpolation. Thus to estimate £»; from a sample of size 
n = 10, we observe that xs estimates the 241 point and zs the A 
point; hence we use as the estimate 


20D ES ^A 4 
os t + LS Eh (rs — 22) (6) 
711 


Given estimates of percentage points, one can obviously estimate the 
various ranges. 

Confidence Intervals. A confidence interval for v is easily con- 
structed by means of the binomial distribution. The probability that 
an observation falls to the left or right of v is one-half in either case. 
The probability that exactly ? observations fall to the left of » is just 


OG o 


The probability that z,, the rth-order statistic, exceeds v is then 


PG, >) 5 p AG &) 
Pu g 


i=s 


and similarly 


If we now suppose s > r, add (8) and (9), and subtract both sides ! 
from unity, we have 


finie y <2) = » () GY (10) 
388 


E 


LOCATION AND DISPERSION §16.3 


which provides a confidence interval for». Ordinarily s is taken to be 
n — r + 180 that the rth observations in order of magnitude from the 
top and from the bottom are used. Thus for a sample of size 6, two 
possible confidence intervals for the median are 


P(r < v <a) = 1 —.04)* — 0* = 9762 = 97 (11) 


and 
Pl <y <0) =1- 3 =e T8 (12) 


Tf one wished to do: so, he could approximate a 95 or 90 per cent 
confidence interval by using linear interpolation between (11) and (12), 
but this is rarely done in practice. One ordinarily restricts himself 
to the confidence levels available with the simple order statistics. 

If the sample size is small, one has only a few confidence levels 
available; in particular, when n = 2, there is only the 50 per cent 
confidence interval given by 


P(e es 22) = .50 (13) 


For moderate sample sizes the binomial sum in (10) may. be computed 
directly or found in tables of the incomplete beta function. For large 
n one would use the normal approximation to the binomial. Since 
the index 4 in (7) is approximately normal with mean 7/2 and standard 
deviation +/n/2 for large n, a 95 per cent confidence interval, for 
example, is obtained by counting 1.96 4/n/2 observations to the left 


and right of the sample median. 
‘A similar technique is employed to obtain confidence intervals for 


percentage points. If £j is the 100p per cent point of the distribution, 
then the same argument used to obtain (10) shows that 


P(r < p <%) = D (") p — p (14) 


i-r 


Thus for a sample of size 6, à possible confidence interval for the 


25 per cent point is given by 


Plas < Eas < ti) = »() A (^ ~ 78 (15) 


A 96 per cent upper bound for £s; is given by 


Pls < zi) = 5 (°) () Gy ~ .96 (16) 


389 2 


e 


§16.4 DISTRIBUTION-FREE METHODS 


Tests of Hypotheses. To test the null hypothesis v = v, against 
alternatives v > v9, one uses the relation (8), choosing in advance an 
integer r so that the probability of a Type I error will have as nearly 
as possible the desired value. "Thus for a sample of size 6 one can 
make the probability of a Type I error %4 € .11 by choosing r = 2. 
If after drawing the sample one finds zs < vo, the null hypothesis is 
accepted; if x > vo, it is rejected. In the same fashion two-sided 
tests of v = vo may be constructed; the two-sided test is obviously 
equivalent to constructing a confidence interval for y and accepting or 
rejecting v = vo according as the confidence interval does or does not 
cover v. Tests on a percentage point £ would be carried out by the 
same technique, using probabilities p and (1 — p) instead of 14 and 14. 

It is now apparent that the distribution-free methods, besides being 
extremely general in that they require no assumption about the form 
of the distribution function, are also extraordinarily simple. No 
complex analysis or distribution theory is needed; the simple binomial 
provides all the necessary equipment for estimation and testing 
hypotheses when one is dealing with a single population. The only 
inconvenience is in the paucity of significance levels or confidence levels 
when the sample size is quite small. 

A word about the discrete case is in order here. We have assumed 
the density function was continuous. If it is diserete, then the 
equalities obtained in this section for confidence intervals and tests 
need to be replaced by inequalities. Thus (10), for example, becomes 


Pa, <» <2 T () (y Q7) 


The reason for the inequality is in the fact that certain observations 
may be duplicated. Thus suppose one wished to estimate » for a 
discrete distribution using a sample of size 6 and a 78 per cent con- 
fidence interval given by x: and zs. Now and then the two smallest 
observations xı and z» will be equal so that the (zs, xs) interval is 
equivalent to the (xı, xs) interval and hence corresponds to a prob- 
ability larger than .78. The same thing may happen at the upper 
limit; zs and zs may be equal so that sometimes the (zs, xs) interval 
is equivalent to the (rs, z;) interval; occasionally it can even be the 
same as the (zi xo) interval and thus correspond to the 97 per cent 
rather than the 78 per cent level. 

16.4. Comparison of Two Populations. A great many distribution- 
free methods have been developed for testing whether two populations 

390 


COMPARISON OF TWO POPULATIONS §16.4 


have the same distribution. We shall consider only two of them, and 
at the end of this section we shall derive a confidence interval for the 
difference between two population medians. First, we shall obtain 
a simple result on the distribution of arrangements of two sets of 
observations from the same population. 

Let 21, 2, * * * , 2p, be an ordered sample from a population with a 
density function f(x), and let ys, ys, * * * , Yn be a second ordered 
sample from the same population. Let the two samples be combined 
and arranged in order of magnitude; thus, for example, one might have 


Yr, Lr, Ta, Yr, La, Ys Yas Ys TH ` a) 


We wish to find the probability of obtaining a specific arrangement of 
this kind. 

If the z’s are transformed to ws by the relation (2.3), and the y's 
transformed to v’s by the same relation, the joint density function 
of the ws and v's is 


g(ur, Ua, * tty Viy U2, ttt, Vn) = mala (2) 


The probability of a given arrangement such as (1) is found by inte- 
grating (2) over the region defined by 


0O<n<u<m<m<u<-+* <i (3) 


i.e, v; is integrated from zero to u1, then ui from zero to us, ete. It is 
readily seen that the value of the integral is nı!n2!/ (nı + nə) |, or simply 


1 ij ee m "y Since there are exactly e s 5) arrangements of 


nı 
nı x’s and ns y’s, it follows that all arrangements of the z's and y’s are 
equally likely. 

Run Test. We now turn to the question of testing the null hypothe- 
sis that two samples have come from the same population. The 
observations in the two samples will be denoted by z's and y’s as above. 
The two sets of observations are combined as in (1) and the number d 
of runs counted. A run is a sequence of letters of the same kind 
bounded by letters of the other kind. Thus (1) starts with a run of 
one y; then follows a run of two z's, then a run of one y, and so on; 
six runs are exhibited in (1). It is apparent that if the two samples 
are from the same population, the z's and y's will ordinarily be well 
mixed and d will be large. If the two populations are widely sepa- 

391 


§16.4 DISTRIBUTION-FREE METHODS 


rated so that their ranges do not overlap, d will be only two, and, in 
general, differences between the two populations will tend to reduce d. 
Thus the two populations may have the same mean or median, but 
if the x population is concentrated while the y population is dispersed, 
there will be a long y run on each end of the combined sample and 
there will thus bea tendency to reduce d. The test then is performed 
by observing the total aumber of runs in the combined sample, accept- 
ing the null hypothesis if d is greater than some specified number do, 
or rejecting the null hypothesis if d < dy. Our task now is to deter- 
mine the distribution of d under the null hypothesis in order that we 
may specify do for a given level of significance. 


We have seen that all of the m T » arrangements of nı z's and 
1 


n3 y’s are equally likely under the null hypothesis. It is necessary 
now to count all arrangements with exactly d runs. Suppose d is 
even, say 2k; then there must be k runs of z's and k runs of y's. To 
get k runs of z's, the nı z's must be divided into k groups, and we wish 
to count all permutations of the & numbers in each group. In short, 
we wish to count all the ordered k-part partitions of n; with zero parts 
excluded. This is readily done with the aid of the generating function 
described in Sec. 2.6 for enumerating the ways of getting a given total 
with a set of dice.» The required number is the coefficient of /"' in 


(erbe a (LY (4) 
a) rene (5) 


voe qu cst ps = 
which is ( ji SA jJ Similarly there are (S d j k-part partitions of 


Ma, excluding zero parts. Any partition of the z's may be combined 
in any partition of the y’s. in two ways to form a sequence like (1); 
the first x partition or the first y partition may be put at the beginning 


of the sequence. Thus we have found the density for even values of 
d: 


i (^ x ') (3 = ') 
KO NET ANA E op (6) 
Ni + Ne 
aa) 
392 
4. L^ 


COMPARISON OF TWO POPULATIONS §16.4 


and by similar reasoning one finds for odd values of d: 


(s i b J a ( h Ng ‘) 
non EC ) HC E : (m 
1 2 


nı 


To test the null hypothesis in question with a probability p for the 
Type I error, one finds the integer do so that (as nearly as possible) 


do 
à h(d) = p (8) 


and rejects the null hypothesis if the observed d does not exceed do. 

The computation involved in (8) can become quite tedious unless 
both nı and m» are small. The distribution of d becomes approxi- 
mately normal for large samples, and in fact the approximation is 
usually good enough for practical purposes when both nı and nz 
exceed 10. The mean and variance of h(d) are 


2i 
E@ = ess zu (9) 
a _ 2nyno(Qnimz — mi — m) . (10) 


74 = (n, + nz) (mi + n — 1) 
and if we let 

ni Hn: = nı = na nı = nB (11) 
these moments become, for large n, approximately 


E(d) = 2no8 (12) 

o3 £ 4na’p? (13) 

The large-sample normality of h(d) is demonstrated by using Stirling's 
formula to evaluate the factorials in (6), substituting for d in terms of 
t defined. by 


= d= noB 14 
2aß /n a4) 


and showing that the logarithm of the resulting expression approaches 
— log Vr — 1$ 


as n becomes infinite. We shall omit the details. Using this result, 
393 


i 


816.4 DISTRIBUTION-FREE METHODS 


one would determine do for testing the null hypothesis at the .05 
level, for example, by putting the right-hand side of (14) equal to 
— 1.645 and solving for d. 

The run test is sensitive to both differences in shape and differences 
in location between two distributions. Often, however, in practice, 
one does not care about differences in shape; he is concerned only with 
location. That is to say, he would like to test merely the null hypothe- 
sis that the population medians are equal: vı = ye. It is not possible 
to make such a test, but the following test of fi(z) = fely) is sensitive 
primarily to differences in location and very little to differences in 
shape. : 

Median Test. As before, let there be an ordered sample xı, 22, 

t, tm from fi(z) and a sample yi, ys, ©- » Ym from fs(y). Let 
Zi, #2, * * * , 2m4m be the ordered combined sample. The test of the 
null hypothesis fi(z) = fo(x) = f(x) will consist in finding the median 
2 of the combined sample, then counting the number of zs, say mi, 
which exceed 2 and the number of y's, Say m», which exceed Z. If the 
null hypothesis is true, we should expect m; to be approximately 11/2 
and m; approximately n»/2. It is clear that this test will be sensitive 
to differences in location between fi(z) and f(x) but not to differences 
in their shape. Thus if f:(x) and f(x) have the same median, we 
should expect the null hypothesis to be accepted ordinarily even though 
their shapes were quite different. 

To make this test, the distribution of mı and ms under the null 
hypothesis is required. Let z, be the ath observation in order of mag- 
nitude, let m be the number of z's which exceed z,, and let mz be the 
number of y's which exceed za. The joint density function of mi, ms, 
2a under the null hypothesis is 


1 
ei [PePe — Feoj are] 
| i) eop- — Fere] + | (3) UP) - reae} 
a!l 
mall maT ll ertt — Fede area} as) 


` where the first term takes account of the case in which 2, is an x obser- 

vation and the second term of that in which z, is a y observation; F(x) 

is the cumulative form of f(x), and dF (za) represents f(Z,)Azq. On 

integrating out z, and combining the two resulting terms, one finds 
394 


Í 
l 


COMPARISON OF TWO POPULATIONS 816.4 


the frequency function for mı and mz to be, say, 


glm, m2) = we (16) 


a 


We observe, on comparing this expression with equation (12.10.17), 
that it is just the distribution of the cell frequencies in a 2 X 2 con- 
tingeney table with all marginal totals fixed when there is independ- 
ence. The contingency table is 


m; Ti» ni Hn — a 


mi — Mı | mi — ma a 


ni Ne 


where the marginal totals are shown to the right of and below the 
closed part of the table. If nı + ms were odd, one would choose 
a = (ni + ms + 1)/2, whereas if the sum were even, one would choose 
a = (nı + n3)/2. Thus the null hypothesis may be tested by using 
either the criterion given by (12.10.8) or the chi-square criterion 
given by (12.10.20). If nı + n» were small, one would use (16) to com- 
pute the exact probabilities instead of using the approximate proba- 
bility given by the chi-square distribution with one degree of freedom. 
The approximation is fairly good if both nı and n: exceed 10. 
Confidence Intervals. In order to obtain exact confidence intervals 
for the difference between the medians of two populations, it is neces- 
sary to assume that the distributions differ only in location. Letting 
21, 29,7 * * , 24, be a sample from a population with median vı, and 
Yi, Yo, * * * , f» a sample from one with median v2, we assume that the 


variates 
Wi = ti — n1 and Vip 


have the same density function, say f(u), with median zero. The 
sample of ws and the sample of v’s are then two samples from the same 
population. If one chooses two integers r and s, he may compute the 


probability that u, exceeds v, as follows: 


PO y re E (?) [E (w) {1 — F(».)*7* dF vs) Y (?) 


i-0 
Eo) — Fo) (17) 
395 


$16.4 : DISTRIBUTION-FREE METHODS 


3 C) 


(18) 
Beats) 
s+ti-—1\(m+nm—s—-i 
es eu Y Nz — 8 ) (19) 
i pra ’ 
ni 


Similarly 
in Dur s'— ‘) 
ni ioe w bAT, 
P(u «w) E Vu (20) 


I ni + na 
ny 


If we now suppose r < s, r’ > s', and vs > vı, then 


P(ys — ty < v — 1 < y. — a) 
= P(u, < v, and uy > vy) (21) 
= 1 — P(w > v) — P(up < vy) (22) 


and the left-hand side of this relation provides a confidence interval 
for vz — vı with a confidence level which is calculable by means of (19) 
and (20). The confidence interval provides a third test of the null 
hypothesis that the two distributions are the same; the hypothesis 
would be rejected if the interval did not include Zero. 

We shall outline a large-sample approximation which may be used 
when nı and ns both exceed 10. Since the sum expressed in (19) is 
one when taken over the whole range of 7, we may regard the summand 
as a density function for a variate i and find the normal approxima- 
tion to that function. The sum may then be approximated by inte- 
grating the approximating function. The mean and variance of i are 


Sng 
E() = nT (23) 
2 _8m_|(s+1)(m+3) , (s+ 1i E 

o l m2 "arg OF) ii| 99 
and their approximate values when nı and ns are large may be found 
by letting 


mtm=n m=na mi -mf s = yn: = Byn (25) 
396 


COMPARISON OF TWO POPULATIONS ' 8164 


and keeping only terms involving the highest power of n. The results 
are 

E(i) = nay (26) 

c? £ nay E 3 ui B (27) 


The large-sample normality of i may be proved in the same manner as 
outlined for d in (6). The sum in (19) may then be approximated by 


f. ^ Leena (28) 
Ti 


eX 2x 
where 
E yy cn 
PES (r— 14-14) — ney (29) 


vnay(l — y)/B 


Given s, one would choose A to give the desired probability level 
(—1.96, for example, to make the probability .025) and solve for 7. 
The question arises as to how s should be chosen. Clearly s should’ 

be greater than 13/2 and r should be less than nı/2. One might, for 
example, make the two differences equal, but a shorter confidence 
interval may be expected by making the differences equal on “‘stand- 
ard" scale. The number of x observations less than vı is approxi- 
mately normally distributed with mean 7,/2 and standard. deviation 
+/ni/2; similarly the number of y observations exceeding v» is approxi- 
mately normally distributed with mean n:/2 and standard deviation 
*/n;/2. We shall then determine s so that 


ue n S ee! (30) 


If one substitutes for nı, ns, and s in this relation in terms of (25) and 
solves for r, then equates the result to the solution of (29) for r, he 
finds 
1 A 
NS RICE TTE ES (31) 
12° 2n (Vab + B) 

neglecting terms with higher powers of 1/ A/n; in terms of the original 
symbols this becomes 


sam y A Vm vm +m (32) 
~ 2 — A/m + vhi) 
397 


816.5 DISTRIBUTION-FREE METHODS 


and from (30) 
m A Ani Mni + mna 
r= = = (33) 
2 2(s/ni + Vn) 
In similar fashion one would argue that good choices for 7” and s’ in 
(22) are given by changing the signs in (32) and (33). 

16.5. A Distribution-free Test for One-factor Experiments.* A 
factor is tested at k levels with n; observations at the ith level; the 
observations may be denoted by z; with i = 1, 2055s. k and 
j—1,2,---,n. The null hypothesis that the factor has no effect 
will be tested by testing in fact whether all the n = Xn; observations 
may be regarded as coming from the same population. Ordinarily in 
practice one is not much concerned with whether the cell distributions 
differ in shape; he is primarily concerned with whether they differ in 
location. Hence the test we shall consider will be a generalization 
of the second test given in the preceding section. 

Let m; be the number of observations in the ith cell which exceed 
the median of the whole set of n observations and construct the con- 
tingency table: 


m; Ma ins my a 


ni = mi | n= m| +++ | m—min—-a 


my ne Nk 
` 
where a = n/2 if n is even or (n — 1)/2 if nis odd. Tt is easily shown 
by the argument used in the preceding section that the density 


function for the m; is 
II (ea) 
iei MI 


n 
a, 
This is just the ordinary distribution for a 2 X k contingency table with 
all marginal totals fixed when there is independence. Hence the null 
hypothesis may be tested by means of the à criterion or the chi-square 


criterion of Sec. 12.10. The chi-square criterion is ordinarily easier 
to use, and using the present notation, it may be put in the form 


_nn-1) 01 sa\? 
x a(n — a) p ni (v. E ) e 


*Sections 16.5 through 16.9 are based in part on unpublished work of George 
W. Brown (see Preface). 


g(m, ms * + +, m) = 


(1) 


398 


TWO-FACTOR EXPERIMENTS, ONE OBSERVATION PER CELL §16.6 


where we have retained a factor n — 1 in the numerator of the coeffi- 
cient instead of replacing it by n, as is usually done, because n is 
assumed large. The expression (1) has a distribution very accurately 
approximated by the chi-square distribution with r — 1 degrees of 
freedom even if n is only of the order of twenty provided all the n; are 
at least five. For smaller values of the n’s one should compute exact 
probabilities from (1). 

To estimate the difference between main effects of the factor at two 
levels, one would use the difference of the cell medians, and a con- 
fidence interval for the difference would be constructed by the method 
described in the preceding section. 

16.6. Two-factor Experiments, One Observation per Cell. The 
observations are denoted by a; with i = 1, 2, +--+ , r and 


feast mee 


The row factor is thus being tested at r levels and the column factor 
at c levels. The distributions of the xi; have medians 


vj — v- o Bi (1) 


where the median of the numbers o; is zero as is the median of the 6;. 
The o; and 8; are identified with row and column effects. "The dis- 
tributions of the x; are assumed to be identical except for location; 
thus the variates z;; — vz are all supposed to have the same density, 
say f(u). Also the z's are assumed to be continuous variates. If one 
or both the factors have randomly chosen levels, we may suppose that 
the density takes account of random interaction effects as well as 
error effects. Otherwise it is necessary to assume the interactions are 


zero. 

We shall examine the null hypothesis that the row effects, a, are 
zero. Under this hypothesis all the observations in a given column 
have the same distribution. Let z; be the median of the observations 
in the jth column, and in the two-way table let the observation zi; 
be replaced by a plus sign if it exceeds 4; or by a minus sign if it does 
not. Ther X c table then consists of plus and minus signs in equal 
number if r is even, or with c more minus signs than plus signs if r is 
odd. Let m; be the number of plus signs in the ith row. If there are 
in fact no row effects, then we should expect the m; to differ from c/2 
only by random sampling deviations, but if there are row effects, then 
the rows with positive effects would have an excess of plus signs while 
those rows with a negative effect would have a deficiency of plus signs. 
The null hypothesis is therefore tested by testing whether the signs are 

399 


816.6 DISTRIBUTION-FREE METHODS 


divided evenly in rows. In fact, we may construct a 2 X r con- 
tingency table: 


p my ma ese My ca 


Lem €—mi |---| e—m. | e(r — a) 


where a = 1/2 if r is even or (r — 1)/2 if r is odd. It turns out that 
the m; do not have the ordinary contingency-table distribution as 
was the case in the preceding section. However the distribution of 
the m; is such that the large-sample distribution of an analogous chi- 
square criterion has, in fact, the chi-square distribution with r — 1 
degrees of freedom. So this table may be tested like an ordinary 
contingency table with all marginal totals fixed. 

The distribution of the m; is best exhibited in the form of a generat- 
ing function; the distribution itself does not have a simple closed form. 
Suppose we let f; be associated with a plus sign in the first row, ts with ` 
a plus sign in the second row, and so forth. Let Galt, lo, * * * , 0) 
consist of the sum of all terms that; can be formed by multiplying the 
ts together a at a time. Thus, for example, 


palta, la, ts, ts) = tite + tits + tits + tots + tala + gla (2) 


Each term of ¢a(ts, - * - , t) describes a possible arrangement of signs 
in a given column. Furthermore it is easily argued that each arrange- 
ment of signs is equally likely; hence the probability of a particular 


arrangement is 1 / C) Now we consider the function 


$a (ty, t >e , ty) |? (3) 
Li 
a 
A little reflection will convince one that there is a one-to-one cor- 
respondence between ways of getting terms (Pur - - - t in the 


numerator of $ and arrangements of signs in the r X c table which 
give rise to mi, ms, + - - , m, plus signs in the respective rows. Hence 


g= 


$= YY Dols ms, ++, meme am D 
mi ms me 
400 


TWO-FACTOR EXPERIMENTS, ONE OBSERVATION PER CELL §16.6 


where g is the density function for the m; On putting all the ¢ = 1, 
$ becomes one, since the sum in (4) is then just the sum of the density 
over its whole space; this is evident from (3) also, since 


T 
$1,» (C) 


there being (^) terms in ¢a(ti, to, * + * , ty). 


It is evident from (4) that ¢ is a factorial-moment generating func- 
tion for the m;. Thus, 


E(m) = 2e with alli 1 (5) 
1 ` 
= c@a-1(ta, ts, * * * , t) [galta to, * ++ 00] at =1 (0) 


() | 

Tout 

E NC (7) 
Tr 

-2 o (8) 


which is the same for all m;, and similarly the variances and covariances 
of the m; are found to be 


ents sa 9) (9) 
y= ca(r — a) izj (10) 
d r*(r — 1) 
Taking m, to be the dependent variate (they are related by 2m; = ca), 
the matrix of variances and covariances for mi, Ma, * * * , m,-1 may 
be inverted to get 
;2 r(r — 1) 
di o Becca 11 
à ca(r — a) an 
bate Ga") Rae 12 
o = calr — a) Vd (12) 


We shall not demonstrate that the m; are asymptotically normally 
distributed. The simplest proof appeals to a generalization of the 
central-limit theorem. If variates yi ys * * *_, Yr-1 are distributed 
with finite variances and covariances, yi, then it can be shown that 

401 


816.7 DISTRIBUTION-FREE METHODS 


the averages j; for a large sample of size c are approximately normally 
distributed with variances and covariances y;/c. In the present 
instance, y1 would be defined to be one or zero according as there was 
or was not a plus sign in the first cell of a column, and similarly for the 
other y’s. The c columns are then regarded as c observations on the 
y; and the g; are then m;/c, which by the general theorem must be 
asymptotically normal as c becomes large. 

A more direct proof of normality for large c could be constructed by 
replacing the t: in à by e/V* (except t, = 1), then showing that log ¢ 
approaches, as c becomes large, the expression 


rol 
2 8i Vea at ER E (13) 


T Dine = 
L 


i=1 


the exponent of e in the moment generating function of a normal dis- 
tribution, as shown by equation (9.5.4). The quadratic form of the 
large-sample normal distribution will have the chi-square distribution. 
with r — 1 degrees of freedom; the quadratic form is 


ore. o6 


and it may be reduced to the expression 


poet (n.- 9 i (15) 


~ ca(r — a) : T 


The ordinary chi-square criterion given by equation (12.10.20), if 
applied to the 2 X r table at the beginning of this section, would differ 
from (15) only in that the numerator of the coefficient of the sum 
would be 7? instead of r(r — 1). Here r is not assumed to be large 
so the difference may be appreciable. 

The null hypothesis that the row effects are zero may therefore be 
tested by the criterion (15), using the ordinary chi-square distribution 
unless c issmall. For practical purposes the large-sample distribution 
is satisfactory if c is as large as 10, or even if c is only 5 provided rc is 20 
or more; for smaller values the exact probability should be computed 
by means of (3). To test column effects, one would, of course, simply 
reverse the roles of rows and columns in the above test. 

16.7. Two-factor Experiments, Several Observations per Cell. 
We shall suppose that there are r rows, c columns, and h observations 

402 


TWO-FACTOR EXPERIMENTS §16.7 


per cell. "The observations are denoted by xi with ? = 1,2, * * ,75 
j—71,2,:::, 6; and k — 1, 2,:--*,h. It is assumed that the 
variates are continuous and have the same distribution except for 
location; if the population cell medians are v; then the variates 
Zir — vj all have the same distribution. The v; may be put in the 
form 

vg = v + ai + Bi + Yi 0) 
where the a; have zero median, the 8; have zero median, and the yi; 
have zero medians in every row and column. If the levels of a factor 
are randomly chosen, then the effects are random variables and are 
regarded as having a zero population median rather than a zero 
median themselves. 

Main Effects against Interaction. To make the test analogous to the 
test of main effects against interaction in the ordinary analysis of 
variance, one simply finds the cell medians Z; (median of the h obser- 
vations in the 7, j cell) and uses the tests presented in the preceding 
section on these cell medians. 

Joint Tests of Main Effects and Interaction. By using a procedure 
similar to that of the preceding section it is possible to construct a 
simple test of the hypothesis that a factor has no effect whatever, either 
in main effects or in interaction effects. Thus we shall consider the 
null hypothesis: a; = 0 and yg — 0. Let 4 represent the median of 
all rh observations in the jth column, and let m be the number of 
observations in the 7, j cell which exceed 4j. Considering a specific 
column, we have just the one-factor situation discussed in Sec. 5, and 


the m;; have the density 
r 
h 
i) 
= (2) 


i=l 
() 
a 

where a = rh/2 or (rh — 1)/2 whichever is an integer. The density 
for all the mi; is therefore obtained by taking the product of (2) over j 
from one to c. We need not, however, deal with this distribution 
expect in the case of small numbers. To test the null hypothesis, one 
would compute the chi square of equation (5.2) for each column and 
add the results to obtain 


, th — D) ay 
cae eee (ms 3 (3) 


403 


816.7 DISTRIBUTION-FREE METHODS 


which is approximately distributed as chi square with c(r — 1) degrees 
of freedom, and the approximation is satisfactory enough for most 
practical purposes if h is 5 or more, or if rch is 20 or more. 

Main Effects against Deviations. The distribution-free test analo- 
gous to the analysis-of-variance test of main effects against deviations 
is fairly simple if interactions are assumed to be zero. Referring to 
the numbers m,; of the paragraph above, let 


X, mij = m (4) 


i.e., n; is the number of observations in the ith row which exceed their 
column medians. There being ch observations in a row, we should 
expect the n; to be roughly ch/2 under the null hypothesis. The 
hypothesis is tested by means of a chi-square criterion much like that 
of the preceding section. We shall merely outline its derivation. 
The 2 X r contingency table here is: 


but it does not have the ordinary contingency-table distribution. 
A factorial-moment generating function for the ni is 


A 
b(t, tz, * + + , t) = coefficient of II? in | a (5) 
i s (nh 
a 
Using this, one finds the means, variances, and covariances of the n; 
to be 


En) =Z (6) 
. .. ca(r — 1)(rh — a) 
Cii Perham (7) 
ca(rh — a EA 

gi = — Teh- 1) : 153 (8) 

The inverse of the variance-covariance matrix for? = 1,2, +++ ,r—1 
is found to be 

uno) d 
= ca(rh — a) = 255 (9) 
404 


TWO-FACTOR EXPERIMENTS 816.7 


and the chi-square criterion is 
ruat) AVI 
X = cath — a) 4 (n = BY (10) 


with r — 1 degrees of freedom. 

Interaction. All the tests described thus far are computationally 
quite simple—merely a matter of counting observations and comput- 
ing a chi square. To test the null hypothesis that the y; of (1) are 
zero, it is necessary first to remove both row and column main effects 
by an iterative reduction; then one proceeds with a test similar to 
those already described. 

Letting 2; be the column medians as before, one removes the column 
effects to a first approximation by subtracting the 4; from the observa- 
tions of the jth column to get a reduced set of observations: 


Tuy = Tijk — $i (11) 
One then finds the row medians 2 and subtracts these out to get 
a, = tin — i (12) 


If the plus and minus signs are balanced in the columns (they will 
obviously be balanced in the rows), the reduction is complete. But 
ordinarily the subtraction of the row medians will upset the balance of 
signs in the columns, and it is necessary to find the column medians 


# of the a//, and subtract these out to get 
— ay (13) 


This process is continued until both rows and columns have zero 
medians, One could, of course, start the reduction with the row 
medians of the original observations rather than the column medians. 

After the reduction is completed, one counts the number of plus 
signs, my, in each cell, counting the zeros as one-half plus and one-half 
minus. The numbers m; and h — mi; form a 2 X r X c contingency 


table with all marginal totals fixed, and the null hypothesis may be 


tested’ by the ordinary chi-square criterion for testing independence 
insucha table. "This interaction test is very nearly but not completely 


distribution free. ‘The approximate chi-square criterion is 


x2 = mh X [nma — (mi m.;/m)E (14) 


JH LL Ud 
Tijk = Tk 


mmh — mim.) 
405 


§16.8 DISTRIBUTION-FREE METHODS 


with (r — 1)(c — 1) degrees of freedom where 
m=) m; m=) m; m= Y m (15) 
j i ij 


The expression simplifies somewhat if m;. = ch/2 and m; = Th/2, but 
this will not always be the case owing to the presence of zeros among 
the reduced observations. 

All the elements for testing in factorial experiments have been pre- 
sented in this discussion of the two-factor experiment. The methods 
carry over directly to more complicated situations. The general rule 
for testing one factor or set of factors assuming the presence of a second 
set of factors is to fit the second set of factors using medians, then 
classify the data according to the first set of factors and test for fifty- 
fifty splits between positive and negative deviations in the various 
classifications. All these tests are special cases of tests in the general 
linear regression problem which will be described briefly in Sec. 9. 

16.8. Simple Linear Regression. A continuous variate z has a 
density f(x) whose median is of the form 


v — a ez (1) 


where o and 8 are unknown parameters and z is an observable param- 
eter. On the basis of a sample of n observations, (xi, 21), (vs, 23), 
“+ +, (En, 2n), it is desired to estimate a and 8 or test hypotheses 
regarding a and 8. 

Point Estimation. Supposing the paired observations to be plotted 
as n points in the z, z plane, the problem here is to fit a regression line 
of the form 


x=a+t pz (2) 


to the plotted points. If we denote the estimates of a and 8 by & and 
B, the two conditions which determine & and f are 


Median of (z; — & — Ba) — 0 fora <2 (3) 
Median of (z; — à — Bz) 20 — fora »z (4) 


where Z is the median of the z; Thus one divides the observations 
into two groups, using the median of the 2’s, and chooses that line 
which makes the median of the deviations zero in each group. (If it 
happens that several z values fall at 2, then the < sign in (3) and > 
sign in (4) would be replaced by < and = if such a replacement would 
more nearly divide the points into groups of equal size.) 

406 


SIMPLE LINEAR REGRESSION 816.8 


In practice, unless the number of observations is quite large, the 
simplest method of determining the line is to plot the points and use à 
transparent ruler to locate the line by eye. For machine work, the 
following iterative procedure may be used: Find the medians of the 
xs and 2’s in each of the two groups. The slope of the line joining 
the two points determined by these four medians is a first approxima- 
tion, say 6’, to B. Let the deviations of the x; from the line, x = ’z, 
be 

t= ai — Bu (5) 
A slope à' is fitted to these deviations in the same manner as above to 
get a correction to 6’. The second approximation to B is 

Bu =p +o (6) 
Now new deviations 

al! = a; — B'z; = si — Fes (7) 

are computed and a slope 6” fitted to them. The third approximation 
to B is 

pi! =p" + 8" (8) 
and the iteration continues until ĝ is determined to the desired degree 
of accuracy. Then à is the median of the final set of deviations. 

Tests of Hypotheses. To test the null hypothesis, a = ao and 8 = fu, 
one divides the points into two groups at 2 and tests whether the two 
groups are both evenly divided by the line. Let mı be the number of 
points above the line for 2; < 2, and let m: be the number of points 
above the line for z; > 2. Both mı and ms have the binomial distri- 
bution with parameter one-half; hence 


SC fos Ge MR 


will have approximately the chi-square distribution with two degrees 
of freedom, unless is small, in which case one would use the exact 


distribution to compute the probability. K 
To test a = a» only, one would fit a line, « = ao + Bz, to the points, 


determining B by the condition 


Median (a; — ao — Ba) = median (a; — a — Bei) (10) 

[E L[»21 
'The number of points, m, above the fitted line (in both groups com- 
bined) has the binomial distribution with mean n/2 under the null 


hypothesis. 
407 


§16.9 DISTRIBUTION-FREE METHODS 


To test 8 = Bo, one would fit a line, z = à + foz, to the points 
determining & by 
à = median (x; — Boz) (11) 


'The points are again divided into two groups on 2 and the numbers 
mı and mz of points above the line in each group counted. These 
with (n/2) — m; and (n/2) — m form a 2 X 2 contingency table 
with all margins fixed, and (unless is small) the null hypothesis may 
be tested by (9), which in this case has only one degree of freedom, and, 
in fact, may be put in the form 


yi = 16 (v. ^ D E (12) 


n 


Confidence Intervals. To obtain a confidence interval for a, one first 
fits a line x = ĝz to the data by the condition 


Median (a; — 82) = median (x; — Bz;) (13) 
zi SE n> 


Tf the deviations of the x; from this line are denoted by 2$, i.e., if 
Ti = x; — Bz; (14) 


then the estimate of æ is the median of the zi, and a confidence interval 
Xn ene obtained by applying the method described for » in Sec. 3 to 
the 2j. 

The simplest description of a confidence interval for B is to say that 
à 1 — p confidence interval is the set of points 8 which would not be 
rejected at the p level of significance by the test described above. 
Thus one might determine the confidence interval by trial and error. 
An approximate method, which may be ordinarily expected to be 
quite satisfactory, is to fit the line x = & + Be and rotate it about the 
point where it intersects the line z = 2. Since the number of points, 
mı, above the line and to the left of Z is approximately normally dis- 
tributed with mean n/4 and variance 1/16, the limits of the confidence 
interval would be obtained by rotating the line until m; reached its 
7/2 and. its 1— (p/2) levels. The slopes of the line in these two 
positions approximate the 1 ~ p confidence limits of 8. 

16.9. General Linear Regression. The treatment of the more 
general case is a straightforward extension of the methods already 
described. Let there be k observable parameters 21, Z2, * + > , z, and 

408 


GENERAL LINEAR REGRESSION 816.9 


let the regression equation be of the form 


k 
y = es + Y ane, (1) 
1 


On the basis of n observations (yi zu, 2w, * * * , Zi) with 
i=1,2---,n 


it is desired to estimate the a’s or test hypotheses about the o's. 

Suppose first that the regression does not involve a, so that we 
merely wish to estimate ai, os, * * , a. The k conditions on the 
observations which determine these estimates are 


k k 
Median (ui lae) median (ui Yeu) (2) 
there being k such conditions, one for each value of r. Thus the 
observations are divided into two groups by the median of each of the 
k z's, and the medians of the deviations in each group of any pair of 
groups are required to be equal. Now turning to the case in which a 
constant a is involved, the condition for determining a is 


&» = median (y; — Dari) (3) 
or, what is the same thing, 
Median (y; — & — Záàz) = 0 (4) 


If we consider any one of the relations (2), it is clear that the median 
on each side of the equation must in fact be the median of the whole 
set of deviations, hence must be ão. Thus to fit a regression function 
of the form (1), one may specify the conditions (2) and (4), or he may 
combine them into 

Median (y; — ão — Záz,) = median (y; — do — Zà) — 0 (5) 
de Xie Zri >Er 
It is worth noting that the estimation of a1, 0s, * * * , œ is entirely 
independent of ao; one could make any assumption he wished about a 
without influencing the estimates of the other a’s. ; 

To test hypotheses about the a’s or estimate them by confidence 
intervals, one would use the procedures described in the preceding 
section. Thus to test e; = ær, one would fit the other constants by 
means of (5) with à; replaced by aio and the relation for r = 1 would 
of course be omitted. One would then test, using a 2 X 2 table, 

409 


§16.10 DISTRIBUTION-FREE METHODS 


whether the signs of the deviations were split fifty-fifty below and 
above 4. A confidence interval for o; is the set of values ay which 
would not be rejected by the test. To test a set of the a’s, for example, 
to test whether the first c of the a’s had the values Ot = ao 


@= 1,2, +--+, 0) 


one would fit the other o/s using the last k — c relations of (5), then 
construct c 2 X 2 tables like that used in the preceding section to test B, 
adding the individual chi squares to get a criterion with c degrees of 
freedom. A confidence region for the 01 09, ** * , a may be con- 
structed from those points in the (aio, 920, * * * , Qo) Space which 
would not be rejected by the test. 

The actual fitting of a regression function requires an iterative com- 
putation. The constant term a» is estimated last by equation (3). 
A first approximation o4 to à, is obtained by finding the slope of the 
line joining the median of (y;, z,;) for z, X £, and the median of (y;, Zn) 
for Zn  £. The oj are then used to compute deviations 


which are again fitted to the z, in the same fashion. The slopes 
obtained are added to the a. to obtain second approximations, a’. 
The process continues until the desired accuracy is achieved; then ao 
is estimated as the median of the final set of deviations. 

16.10. Tests of Association. Given a sample of n observations from 
a bivariate population, (z;, yi), (zs, ys), < +- s (En, Yn), the problem 
is to test whether the two variates are independently distributed. We 
assume both variates are continuous so that the probability is zero 
that two observations have the same value. 

Contingency Test. The simplest test that comes to mind for this 
problem is to test whether a regression line fitted to the points has zero 
slope. The test amounts merely to dividing the n points into four 
groups by the two lines y = j and x = z. The numbers of points in 
the four quadrants form a 2 x 2 contingency table and have the con- 
tingency-table distribution under the null hypothesis. The chi-square 
criterion (with one degree of freedom, since all marginal totals are fixed) 
may therefore be used unless n is small. 

Corner Test. The so-called corner test appears to be the best test 
yet developed for the problem at hand. "There is no proof that it is 
best, but in the event z and y are not independently distributed, this 

410 


TESTS OF ASSOCIATION §16.10 


test appears most likely to reject the null hypothesis. It is as simple 
to use as the contingency test. 

The test is performed as follows: First the observations are divided 
into four groups by the medians as in the contingency test (the solid 
lines of Fig. 74). Now we shall arrange to deal always with an even ' 
number of points. If » = 2m, the two median lines will not intersect 
any points. (In practice they may, because of coarse measurement. 
If, for example, the horizontal line intersects two points, one may 
choose one of them arbitrarily and move it slightly up or down accord- 
ing as a tossed coin falls heads or tails; the other would be moved in 
the opposite direction. A similar procedure would be used for four, 
^o 


---——-dj[-.-----------------2 ee NA 


^ 


Fra. 74. 


Six, - * + points on a median line.) If » = 2m + 1, the two median 
lines will each pass through a point. In case it is the same point, that 
point is omitted. In ease the two points are different, they are both 
omitted and a new point constructed from the two coordinates of the 
omitted points which are not medians. The circled point in Fig. 74 
was added to the original data in this manner. Thus, in any case, we 
shall deal with an even number of points, say 2m. , 

The dashed lines of Fig. 74 are now constructed. Starting at the 
left, a vertical line is moved to the right until one encounters a point 
on the opposite side of the horizontal line from the first point encount- 
ered. The upper horizontal dashed line is moved down from above 
until one encounters a point on the opposite side of x = 4 from the 
first point encountered. The other two dashed lines are located 

411 


§16.10 DISTRIBUTION-FREE METHODS 


similarly, the right one by moving to the left, the lower one by moving 
up. Let rı denote the number of points to the left of the left line; rs 
denote the number of points above the upper line, etc. Points in the 
upper right and lower left quadrants are counted positive while those 
in the other two quadrants are counted negative. Thus in Fig. 74, 
Tin =l, r2 = 3, r; = 4, m, = 4. 

The test criterion is 


T =r +r +r +r (1) 


and it is intuitively clear that a large positive or negative value of r is 
evidence against independence while small values of r are expected 
if the null hypothesis is true. We must find the distribution of r 
under the null hypothesis in order to determine the critical level for a 
desired probability of a Type I error. 

If x and y are independently distributed, then a random sample of n 
pairs (x, y) is nothing more than a sample of n 2’s and a sample of n y’s 
paired at random. If the 2’s are ordered with x, the smallest, z» the 
second smallest, and so on, then the sample of n y's may be paired 
with the z's in n! ways corresponding to the n! permutations of the 
ordered y values, and under the null hypothesis all of the permutations 
are equally likely. Our distribution problem therefore is simply a 
matter of counting the number of permutations of the 2m y values 
which give a specified value of 7; this number divided by (2m)! is the 
probability of r. 

Let us suppose for the moment that all four of 11, 72, Ts, r4 are positive, 
and suppose also that the number of points in the upper right quadrant 
is j; then there will be m — j points in the upper left and in the lower 
right quadrants and j points in the lower left quadrant. The numbers 
r2 and rs depend only on the m z's greater than £ and the m y’s greater 
thang. For rto be positive, the j y values in the upper right quadrant 
must include the top rz ys but not the one just below them. The 
other j — ra y’s in this quadrant must therefore be selected from the 
m — T? — 1 smallest of the m y’s greater than J; this selection can be 
made in @ dU n ways. The j y’s that have been selected must 
now be associated with j of the z's to right of and among these must 
be the top rs z's, since rs is assumed positive, but not z», ,. "The 


other j — rs values of z may therefore be selected in i S ird ') 
merle | 
ways from the smallest m — rs — 1 values of z to the right of £. 
412 


TESTS OF ASSOCIATION §16.10 


A similar argument shows that there are 


(ere erer 

jute Ve 

selections of y's and z's which give positive values of r; and r, in the 
lower left quadrant. After all these selections have been made, the 
remaining m — j y values above fj are assigned to the remaining m — j 
values of x to the left of z, and the remaining m — j y values less than 
ğ are assigned to the remaining x values to the right of 4. The y 
values in any quadrant may be permuted at will. Thus there are ap 
permutations of y values in the upper right quadrant, (m — j)! per- 
mutations of y values in the upper left quadrant, and soon. The total 
number of y permutations which give j points in the upper right 
quadrant and the given values of the 7’s is therefore 


(ap eee Cine 
iym — jm -= j)! (2) 


For any other assignment of signs to the 7’s, the argument is just 
the same, and the expression (2) would be changed only in that the 
lower index of the binomial coefficients would be different for negative 
rs. If we let sa (a = 1, 2, 3, 4) represent the numerical value of ta, 
Le, Sa = Ta if Ta is positive and s, = —Te if fa is negative, then the 
binomial coefficient corresponding to ra in (2) is 


pan) 
j — Sa 

uad 
m —j-— sa 


if ra is negative. We have given enough details now that it is fairly 
easy to show that the factorial-moment generating function for r is 


if rq is positive, and is 


CAO) = E(t’) = P(t) 


Il 
Mm 
E 
ale 
8| 
ss, 


§16.11 i DISTRIBUTION-FREE METHODS 


and, of course, the probability for a given value of r is the coefficient 
of ¢ in (3). 

When 7 is small, (3) may be used to tabulate the distribution of r. 
The large-sample distribution of r is not normal. It can be shown, 
though we shall not do so here, that when n becomes infinite, the 
generating function (3) becomes simply 


o) = (à g-erp 4 X 2-eeop-o* a) 


The limiting distribution has been tabulated, and it is found that the 
5 per cent limits on r are +11 and the 1 per cent limits are +14; i.e., 


P(—11 <r < 11) 2.95 (5) 


so that if r equals or exceeds 11, the hypothesis of independence is 
rejected at the 5 per cent level of significance. 

The small-sample distribution of r has been tabulated, and it is 
found that the limiting 5 and 1 per cent levels are quite satisfactory 
if the sample size is ten or more. Thus, though the distribution 
problem is rather troublesome, the application of the test is quite 
simple. 

16.11. Power Functions. No generally accepted theory of power 
functions for distribution-free tests has yet been developed, and we 
shall therefore confine our discussion to a few brief remarks. 

The great difficulty in obtaining a power function arises from the 
fact that the functional form of the distribution is not Specified. Sup- 
pose, for example, that one wishes to test the null hypothesis that the 
median v of a population has the value y = 0. What is the power of 
the test of Sec. 3 at v =1? It is apparent that it depends entirely 
on the form of the distribution. If the distribution happens to be 
normal and ¢ = 0.1, the power will be very high at v = 1 even for 
small samples, but if ¢ = 10, the power will be quite low for small 
samples. It is thus apparent that a power function in the ordinary 
sense does not exist even for a specified family of distributions; the 
actual distribution is needed. 

To circumvent this difficulty, it has been suggested that the power 
be computed as a function of F(v) instead of as a function of v; F(z) 
is the cumulative distribution of the population. The null hypothesis 
mentioned above takes the form F(0) = 14, and the alternatives are 
0 < F(0) € 1 excepting F(0) = 44. Thus the null hypothesis states 
that z — 0 is the median of the population while the alternatives state 

414 


PROBLEMS i §16.13 


that x = 0 is some other percentage point of the population. This 
power function for the test of the median is then identical with the 
ordinary power function of the test that p = 14 for a binomial popula- 
tion. It is clear that this device is applicable to many of the tests 
discussed in this chapter. 

16.12. Notes and References. Many writers have contributed to 
the development of distribution-free techniques of analysis. A large 
share of the credit for these developments belongs to 8. S. Wilks, who 
was among the first to realize the importance and potentialities of this 
field and who encouraged many of his students to work in it. A paper 
which gives a comprehensive survey of all the developments up to the 
date of its publication and a rather complete bibliography is S. 8. 
Wilks, “Order statistics,” Bulletin of the American Mathematical 
Society, Vol. 54 (1948), pp. 6750. 


16.13 Problems 


1. Find the density function for u = F(x,), where x, is the rth 
ordered observation of a sample of size n from a population with 
cumulative distribution F(x). 

2. Derive the density function given in equation (2.10) by integrat- 
ing (2.4). 

3. Derive (2.10) by a geometrical argument, considering the z axis 
divided into five intervals as illustrated. The sample is regarded as 
coming from a multinomial population with five categories having 


i z 
r 
Ay az x 


probabilities (y — Ay/2), f(y)Ay, FE — 42/2) — F(y + ^y/2, f(2)Az, 
1 — F(z + Az/2), and in such a way that r — 1 observations fall in 
the first category, one in the second, and so on. The density of z is 
f(x) with cumulative F(x). . 

4. Use the geometrical method of Prob. 3 to find the joint density 
function of u, the area between z, and tr and v, the area between 
x, and x, with g <r « s € t. 

5. Show that the expected value of the larger of a sample of two 
observations from a normal population with zero mean and unit 
variance is 1/4/z, and hence that for the general normal population 
the expected value is u + (c/A/m). 

6. If (x, y) is an observation from a biv: : 
with zero means, unit variances, and correlation p, 


expected value of the larger of x and y is /(1 — p/m. 
415 


ariate normal population 
show that the 


b 


$16.13 DISTRIBUTION-FREE METHODS 


7. Derive equation (4.7). 
8. Verify equations (4.9) and (4.10). 
9. Show that ¢ defined by equation (4.14) is approximately nor- 

mally distributed for large n. 

10. Verify equations (4.23) and (4.24). 

11. Verify equation (4.31). 

12. Derive the distribution given in (5.1). 

18. Verify (6.9) and (6.10), and show that (6.11) and (6.12) do in 
fact define the inverse matrix. 

14. Provide the details of the argument for normality which uses 
(6.13). 

15. Verify (6.15). 

16. Show that (7.5) is a generating function for the factorial mo- 
ments of the n;. OF 

17. Verify equations (7.6) through (7.10). 

18. Show that the distribution of r of Sec. 10 is symmetric about 
r = 0, hence that E(r) = 0. 

19. Show that the limiting variance of r is 24. 

20. Check the statement at the end of Sec. 10 by tabulating the 
cumulative distribution of the numerical value of r for n = 10. If s 
is the numerical value, it is found that P(s > 10) = .0642, 


P(s > 11) = .0436 


P(s > 14) = 0127, P(s > 15) = .0095. The corresponding values 
for n infinite are .0533, .0342, .0082, .0050. 

21. Complete the derivation of (10.3). 

22. If ài, 2, * + - , v, is an ordered sample from a population with 
cumulative distribution F(x), find the density for 


_ Gx) — Fæ) 
[F(@n) — F(1)] 

23. The active life x, in hours, of radioactive atoms has the density 
(1/0)e-*?. To estimate 0 for a particular kind of atom, a sample of n. 
atoms is put under observation, but the experiment is to stop when the 
rth atom has expired; i.e., it is intended not to wait until all the atoms 
have ceased activity, but only until r of them (r chosen in advance) 
have. The data consist then of r measurements £129 *** ,2, and 
n — r measurements known only to exceed z,. Find the maximum- 
likelihood estimate of 6, and show that it has a chi-square distribution. 
Note that the likelihood contains the factor [L — F(zj)]-7 where F(z) 
is the cumulative distribution. 

416 


PROBLEMS 816.13 


24. Referring to Prob. 23, must one start with newly activated 
atoms, or is it all right to start with atoms that have already been 
active for various lengths of time (and are still active)? 

25. If x is uniformly distributed between 0 — 1$ and 0 + 14, find 
the density for the median £ for samples of size 2k + 1. 

26. Referring to Prob. 25, find the density for z = (zi + 241)/2. 
Is z or 4 the better estimator of 0? ? 

27. Show that the sample median is a consistent estimator of the 


population median. 
28. We have seen that the sample mean for a distribution with 


infinite variance (like the Cauchy distribution) does not necessarily 
converge in any sense toward the center of the distribution as the sam- 
ple size increases. Does the sample median converge to the population 
median in such cases? 

29. If a population has density function 


fe) = were 


Pa) 
= Yer ase 


IA IV 


find the maximum-likelihood estimate of 9 for samples of size n. 

30. A common measure of association for two variates x and y is the 
rank correlation, or Spearman’s correlation. The x values are ranked 
and the observations replaced by their ranks; similarly the y observa- 
tions are replaced by their ranks. Thus for samples of size n one 


might have: 


Using these paired ranks, the ordinary correlation is computed 
WX: - XG q 02d 
Vaa: — Xyxd — Y»? m-—n 
where the capital letters represent the ranks, and d; = X; — Yi 
Verify that the given relation is true. 


31. Show that the distribution of S of Prob. 30 is independent of the 


form of the distributions of x and y, provided that they are inde- 
417 


816.13 DISTRIBUTION-FREE METHODS 


pendently distributed, hence that S is a distribution-free criterion for 
testing the null hypothesis of no association. 

32. Show that the mean and variance of S under the hypothesis of 
independence are zero and 1/(n — 1). To do this, show that S may 
be put in the form 

SEU? [o _ n(n + 24 


n n 4 
where Q = ZiY; (replacing X; by 2), and observe that the coefficient of 


n 
H win 
1 


$) = 1 Il (Sue) Us ) 


j=l 
is a factorial-moment generating function for Q. 


33. Apply some of the distribution-free methods to sets of data to 
be found in problems of Chaps. 13 and 14. 


418 


DESCRIPTION OF TABLES 


I. Ordinates of the Normal Density Function. This table gives values 
of 


il 
f) = Vin 


ga 


for values of x between zero and four at intervals of 0.01. Of course 
one uses the fact that f(—x) = f(x) for negative values of x. 
II. Cumulative Normal Distribution. "This tabulates 


= E 1 12/2 
F(x) Je vena 
for values of z between zero and 3.5 at intervals of 0.01. For negative 
values of ?, one uses the relation F(—2z) = 1 — F(x). Values of x 
corresponding to a few round values of F are given separately beneath 
the main table. 

II. Cumulative Chi-square Distribution. This table gives values 
of u corresponding to a few selected values of F(u) where 


Fu) = E g(n-2)/2 e—2/2 dx 


for n, the number of degrees of freedom, equal to 1,2, - + - , 30. For 
larger values of n, a normal approximation is quite accurate. “he 
quantity +/2u — V2n — 1 is nearly normally distributed with zero 
mean and unit variance. Thus Ua, the a point of the distribution, 
may be computed by 


Ua = V6 (te + A/2n — 1)? 


where aq is the æ point of the cumulative normal distribution, As an 
illustration, we may compute the .95 value of u for n = 30 degrees of 
freedom: 


wos = 14 (1.645 + 4/59)? 
= 43.5 


which is in error by less than 1 per cent. 
420 


TABLES 


IV. Cumulative “Students” Distribution. This table gives values 
of t corresponding to a few selected values of 


ia 
t Pa A ! 
F() 35 Ja um cra ds 
Erval) 
with n = 1, 2, - ++ , 30, 40, 60, 120, «. Since the density is sym- 


metric in ¢, it follows that F(—4) = 1 — F(t). One should not inter- 
polate linearly between degrees of freedom but on the reciprocal of the 
degrees of freedom, if good accuracy in the last digit is desired. As an 
illustration, we shall compute the .975 value for 40 degrees of freedom. 
The values for 30 and 60 are 2.042 and 2.000. Using the reciprocals 
of n, the interpolated value is 


4 
2.042 — mes (2.042 — 2.000) = 2.021 
Ho — 
which is the correct value. Interpolating linearly, one would have 


obtained 2.028. 
V. Cumulative F Distribution. This table gives values of F corre- 


sponding to five values of 
a) mph n, + mac) (992? dag 


GF) = 4 2 
GG 


2 


for selected values of m and n; m is the number of degrees of freedom 
in the numerator of F, and n is the number of degrees of freedom in the 
denominator of F. The table also provides values corresponding to 
G = .10, .05, .025, .01, and .005 because Pi. for m and n degrees of 
freedom is the reciprocal of Fa for n and m degrees of freedom, Thus 
for G — .05 with 3 and 6 degrees, one finds 


e 


1 if Š 
Fs(3, 6) = Fos (6, 3) I RIO a 412 


One should interpolate on the reciprocals of m and n as in Table IV 


for good accuracy. 
421 


Taste I. 


TABLES 


1 


F(z) = D 


ez 


ORDINATES OF THE NorMAL Density FUNCTION 


.03 


.09 


CONAN PWNS ONSA Pobo (OUO RwNHOo bou RwWWHoO 


www www NNNNN VYY Hmmm eee 


422 


8973 
.9918 
.8825 
-3697 
8538 


.9352 
.3144 
.2920 
.2685 
.2444 


.2203 
.1965 
.1736 
.1518 
.1815 


.1127 
.0957 
.0804 
-0669 
.0551 


.0449 
.0363 
.0290 
.0229 
.0180 


.0139 
.0107 
.0081 
.0061 
.0046 


.0034 
.0025 
.0018 
.0013 
.0009 


.0006 
.0004. 
.0003 
.0002 
.0001 


Se 


TanLE II. 


de is Vie 


TABLES 


1 


ren dt 


CUMULATIVE NORMAL DISTRIBUTION 


-01 


-02 


-03 


04 


.05 


.07 .08 .09 


.5000 
.5398, 
.5793 
6179) 
6554) 


-6915 
«7257! 
- 7580) 
- 7881 
-8159) 


.8413 
.8643 
8849) 
.9032 
.9192 


.9332 
.9452 
9554) 
.9641 
.9713 


bowbhHo bouon GOL O 


Bee 


9772 
9821 
-9861 
.9893 
.9918 


-9938) 
-9953) 
9965) 
9974 
.9981 


9987 
.9990 
.9993. 
9995) 
9997, 


SEE CO CONGR KO FORD (EO EQ EO ERO EE eee te 
RONHO CONN geo oO oo 


T 


.5040 
.5438 
.0832 
.6217| 
-6591 


-6950 
-7291 
-7611 
«7910 
.8186 


.8438 
. 8665, 
. 8869 
. 9049 
-9207 


.9345 
.9463 
.9564 
.9649 
.9719 


.9778 
- 9826) 
9864 
9896) 
.9920 


.9940, 
9955) 
.9966 
9975) 
- 9982 


.9987| 
.9991 
9993) 
.9995 
.9997 


. 5080. 
-5478) 
-5871 
.6255 
-6628 


-6985 
7324 
- 7642) 
+7939) 
.8212 


.8461 
. 8686, 
- 8888) 
-9066) 
.9222 


.9357| 
.9474 
.9573. 
.9656. 
.9726. 


.9783 
.9830 
- 9868) 
- 9898) 
- 9922) 


-9941 
9956) 
- 9967 
- 9976) 
- 9982) 


.9987| 
.9991 
9994. 
.9995| 
-9997| 


5120) 
5517 
-5910 
.6293 
.6664 


. 7019) 
-7357 
-7673| 
«7967 
- 8238) 


.8485 
.8708, 
-8907 
.9082 
-9236) 


.9370, 
9484 
.9582 
9664 
.9732. 


-9788) 
9834 
.9871 
9901 
.9925 


-9943) 
.9957 
-9968) 
9977 
9983) 


-9988) 
.9991 
.9994. 
.9996 
-9997, 


-5160) 
5557) 
- 5948) 
-6331 
6700) 


-7054 
7389) 
- 7704 
-7995| 
.8264. 


. 8508) 
.8729| 
.8925| 
-9099) 
9251 


-9382) 
-9495| 
.9591 
.9671 
.9738 


.9793 
9838) 
.9875 
9904 
9927) 


9945) 
-9959) 
-9969) 
-9977) 
9984) 


.9988 
.9992 
.9994 
.9996, 
9997 


.5199 


.9596| . 
.5987| . 
-6368| . 
.6736| . 


.7088 
7422) 
-7734 
. 8023 
-8289 


.8531 


:8749| . 
18944 . 


.9115 
.9265) 


-9394) . 


.9505 
-9599) 


19678} . 
19744) . 


-9798 


19842). 


-9878) 


:9900| . 


9929 
.9946 


.9960| . 
.9970| . 
.9978| . 
9984) . 


-9989) 


.9992| . 
:9994| . 


.9996 
.9997 


1.282/1 .645/1 .960 


2. 


.5279| .5319| .5359 
.5675| .5714| .5753 
.6064| .6103| .6141 
.6443| .6480| .6517 
.6808| .6844| .0879 


.7157| .7190| .7224 
-7486| .7517, .7549 
-7794| .7823| .7852 
.8078| .8106| .8133 
.88340| .8365| .8389 


.8577| .8599| .8621 
.8790| .8810| .8830 
.89080| .8997| .9015 
.9147| .9162, .9177 
.9292, .9306| .9319 


.9418| .9429| .9441 
.9525| .9535| .9545 
.9610| .9625) .9633 
.9693| .9699| .9706 
.9750, -9761| .9767 


.9808| .9812| .9817 
.9850| .9854| .9857 
.9884| .9887| .9890 
.9911| .9913| .9916 
.9932, .9934| .9936 


.9949| .9951| .9952 
-9962| .9963| .9964 
.9972| .9973| .9974 
.9979| .9980| .9981 
.9985| .9980| .9986 


.9989| .9990| .9990 
.9992| .9993| .9993 
.9995| .9995| .9995 
.9996| .9996| .9997 
:,9997| .9997| .9998 


326/2.576|3.090/3.291 |3.891 4.417 


F(x) 


.90 


.95 


-975 


99 


.995| .999| .9995| .99995 .999995 


21 — F(z)l 


.20 


.10 


.05 


.02 


423 


.01 | .002| .001 | .0001 | .00001 


CUMULATIVE CHI-SQUARE DISTRIBUTION* 


^ Fe) -f 


Tanrz III. 


a 
ES 
E 
Ei 
2 
S 
3 
s 
E 
E 
8 
8 
sjel 
SW 
bM 
Sle 
Ts 
HE 
= 


8 


ero vis 


trika, Vol. 32 (1941). 


‘ome! 


"B 


ion, 


distributi 


i-square 


function and of the chi 


plete beta fi 


ie incom; 
Catherine M. Thompson, and the editor of Biometrika. 


its of thi 


percentage 
f the auti 


permission of 


* This table is abridged from “ Tables of 


Itis here published with the kind 


TABLES 


TABLE IV. CUMULATIVE "SrupENT'S" DisTRIBUTION" 


n-—1 


P i. a) 


dx 
n—2 3\ ete 
-( 2 )t vem (1 +2) 

F 7 
E 45 .90 95 .975 -99 .995 .9995 
1 1.000 3.078 | 6.314 | 12.706 | 31.821 | 63,657 | 636.619 
2 .816 1.886 2.920 | 4.303 6.965 | 9.925 31.598 
3 .765 1.638 | 2.353 3.182 4.541 5.841 12.941 
4 -741 1.533 2.132 2.776 3.747 | 4.604 8.610 
5 .427 1.476 2 015 2.571 3.305 | 4.032 6.859 
6 .718 1.440 1.943 | 2.447 3.143 | 3.707 5.959 
a eee 1.415 1.895 2.365 2.998 | 3.499 5.405 
8 .706 1.397 1.860 2.306 2.896 | 3.355 5.041 
9 .703 | 1.383 | 1.833 | 2.202| 2.821 | 3.250 4.781 
10 .700 | 1.372| 1.812 | 2.228| 2.764| 3.169 4.587 
11 .697 | 1.363 | 1.796 | 2.201 | 2.718 | 3.106 4.437 
12 .695 | 1.356 | 1.782 | 2.179 | 2.681| 3.055 4.318 
13 .694 1.350 1.771 2.160 2.050 | 3.012 4.221 
14 .692 1.345 1.761 2.145 2.024 | 2.977 4.140 
15 .691 | 1.341 | 1.753 | 2.131 | 2.602] 2.947 4.073 
16 .690 1.337 1.746 2.120 2.583 | 2.921 4.015 
17 .689 1.333 1.740 2.110 | 2.567 2.898 3.965 
18 .688 | 1.330 | 1.734| 2.101| 2.552| 2.878 3.922 
19 .688 | 1.328 | 1.720 | 2.093 | 2.539 | 2.861 3.883 
20 .687 | 1.325 | 1.725 | 2.086 | 2.5028 | 2.845 3.850 
21 .686 | 1.323 | 1.721 | 2.080 | 2.518| 2.831 3.819 
22 .686 1.321 1.717 2.074 | 2.508 | 2.819 3.792 
23 .685 1.319 1.714 2.069 2.500 | 2.807 3.767 
24 .685 1.318 1.711 2.064 2.492 | 2.797 3.745 
25 .684 | 1.316 | 1.708 | 2.060 2.485 | 2.787 3.725 
26 .684 | 1.315 | 1.706 | 2.056 | 2.479 | 2.779 3.707 
27 ‘684 | 1.314 | 1.703 | 2.052] 2.473 | 2.771 3.090 
28 1683 | 1.313 | 1.701 | 2.048| 2.407 | 2.763 3.674 
29 “683 | 1.311 | 1.699 | 2.045 | 2.462 | 2.756 3.659 
30 ‘683 | 1.810 | 1.097 | 2.042| 2.457 | 2.750 3.646 
40 .081 | 1.303 | 1.684] 2.021 | 2.423 | 2.704 3.551 
60 “679 | 1.296 | 1.671 | 2.000 2.390 | 2.660 3.460 
120 "677 | 1.289 | 1.658] 1.980 | 2.358 | 2.617 3.373 
% 1674 | 1.282] 1.645 | 1.960 2.326 | 2.576 3.291 
LL upper prp G a 


* This table is abridged from the "Statistical Tables" of R. A. 


by Oliver & Boyd, Ltd., Edinburgh and London, 1938. It is here published with the 


of the authors and their publishers. 


42b 


isher and Frank Yates published 


kind permission 


TABLES 


* 


A coo quoa 


OD + 
WAS eig 


z 


426 


go-2 [99:9 [S9 Jor'9 |ro-9 |r9:9 (0:2 jīz:z |pe:£ loc: |eo0:2 |ee:4 joe's he's lose |o 
98r jo.» igoe joze jøe:e |ee-s |zo-g jis: jie: jeo:9 |SL'9 |ze-9 lesto lioz les leo 
19.8 jeze Sit ort 0r LT joc joe's loe: lers lees |C0-F lzer leo-e le loo 
gez £6. jtoe jgo:e cte ce sce See lee |vreg ee eee joo ee or los 
626 jreg jpeg jses jeez lore logg lpo logg 6S |c9'Z [249°% ezg |I8'€ [zez tt 
$04 OU ||£:£ S92 952 2672 grs ses fies [os js lore kizee jror leor lr 
$9'€ iplis jes's |66'€ |9T'9 |T&'O |2$'9 [z9'0 lzz:9 |v8'9 66:9 |61'2 |9F'2 |9892 |ep'S |oc 
HP jozi} je joey Lv [zee Uo oxy |cs-* |o6 |ó6-Y ere leza zoe lero (te 
fore jeg joge BE ws |e2 Jet 9: sos Lt jo st 65 |l br m 
L6 [FG [ieg lec lege |c9'c 9'c lorz izzz ELc (82's |e&'c [882 |oe'z zoe lo 
88°S J00'6 jcI'6 j9g'6 jes:e |18°6 jo‘or |z'or |For |o- T (801 JIII |e'IT |O'eT |6'cI jon 
88:9 10:9 904 c4 (PL (9:2 (er Us 62 rs loca urs rs lore lero col 
TL er per os Us ues gs ovS js Jos jors Jess Jess |ez'9 |09-9 lz) 
19'€ ££ |r&$ jig'g [s'& [Fe'e [00 [90°F lots ler Ic (Sc [Gee ec's lone [FI 
Sb c |e" 9L psc vec usc poc jes pec sec |o joe |l lstre leee Dre 
Der (ger er jagr (ovr [Der (wer lover ier lo:t lest e»t lesi (o'er leor lesi 
gore jire joze jge.e jesze [26 jes' 6 jr-or zor |eor Sor jz:or Jor xir Jigi Jeet 
£09 (209 |cI'9 [$69 jeg'9 |eF'9 (|c€'9 |c99 [899 |oL9 S8'9 |86'9 (STL leg lor'z |ev's 
Zelk ovy Ev» js pocy cov gov Lv LY csv SS [6 |co-s ors liro leze 
IUe jere Ue je |I&'£ jpeg | ogee |c&'& |FE'& 4£'$ |0f'$ lore |ce'e (zoe (gee 
g.er jelet joier jeler oc oz roc pos [rie Jede je jJozz loze letez lesz voz 
ger eet [LI (S SD JOPI GI |P PI lopi [Zr jer joer jeter jeer (0-9r |ior los. 
9c8 {I'S 58 |v's boc's [99's zs [res los |se's o lo 6 le'e loore lese o ooi 
£99 |99'S |69'S jez joss |os:c |r6'9 j96's joo'9 |ro'9 60:9 |91'9 ]9z79 jego ]l69'9 lra'9 
94'E (82'E |60/£8 [28's [Pe [sS'8 |06'€ (ze'e [|c6'e (cee 86'e 10°F |20} [ITF jet [čet 
S.I .S rizr joier jc» (Deb jeer (rs job [UYR lr laise [£o Izvor lo's le'er 
I'96 |G'96 £96 |9'0G |L'O0c |6'90 [Tze |z 2c le-ze |p 4c |z 6:27 (ese |Lss |e'6c (sog 
Ser JESI JOTI jI PI je vL YI SI yr Sr josi J9 1 japi je cc Jior leet j079i 
$9'8 S98 |l9 8 (c9'8 |99'8 |OL'S |FZ'S |6L'S |TS$'8 |cs'g 68'8 |f6'S |1t0'6 |ct'6 |sc;6 [eee 
ETS FTS gere US jte [cs (c9 ecc ves los pcs jc nes ves jos lors 
66T 661 661 661 661 661 661 661 66T 661 66T 661 661 66T 661 661 
2.68 2°66 |9.60 jeles |F'6o |v'oo |p'66 |r'66 |y'66 |se |v'oo lese |e-06 \z-66 lze 0 00 
2°68 |c'0£ [¢'6e |c'68 |r'6e jpeg |p'68 Foe |P'6e |r'6g P'68 [6766 |£'68 |c'6g |z'ee |0:6& 
G GI (S GI |9 GT (eer (o'er Fer |For |ver [wer ver wor | er ger zet jeet joer 
6r 6 ro r6 jore r6 r6 jire jeee jses c6 les'e leee lez'e |vz6 ]9r 6 006 
008 ‘93/00F Cc 00z 2 000 9« (008 ‘£2009 yz O0T "yz DOz rz 001 "vc 006: ¢z [002 ‘ea 00r ‘ez oot ' ec/ooc ‘zelooo tz 000*0z 
Q/g 9 jore 9 jore 9 jogg 9 Orc'9 jor'9 jrI'9 90'9 Jozo'9 Jose's Jose's jogs's |092'S JOc9'G lor‘ |000'C 
OZOT [OTOI [OTOT |0001 |e66  |c86  |226 |696 £96 (¿$6 |8ř6 [286  |zc6  |006  |r989 (008 
Fee, jesz jega josz (stg  |9vc ipta iz tee  |eez (zez ises loez lezz loiz looz 
$'£9 |I'£9 |8'c0 |&'c9 |L'I9 |&' 19 |Z209 |c 09 j6'69 |F 6c le'se ic 8e |2'2e |8'00 |9'g£9 oer 
Ej c 09 0 st | gr | or 6 8 L 9 ri * € 
ile/(c —M)lile/(e — wu)] J SopS 
D a/ (upu) (TUHU) zut g/t zut j|g/ (g — U -F u)) gr 


JoyUTUIOUSp ur  *10jexoumur UT WOP JO səə13əp w 
4NOILOASTMISI[ J TAILVIANIAQ “A TITL 


Ee 


E 
Sá 


an 


2.71 
3.33 


Omt SONOS 
SAS ADRS Ni 


Sooo INCISED. 


TABLES 


£5955 22838 


Haan iaa 


SSRSF 88329 28E9S9 83888 


HAAN Ren H nel 


DO 
So 


an 


2.18) 2.16 


2.75) 
4.40) 4.31 


5.30) 5.19 


- 


co cot 


Se gaea 
ASN ODN TO 


Soot TOON 
EOS OSANO 


Hann ioo 


15:500 7369 ORAND NOHO 


HN 
as 


eic 


3.45| 3.39 


2.21 
2.79 
4.48 
5.41) 


EEEIEI] 
ASH coin 


coat ioi 


AONLH OIA 
95355 $3859 


aaan Bean 


Reena so 
A288 AR 


AAs ciet 


=o EISE 
BAS Sore 


COWS cielcico s 


IX m uuo 
BASSAS HOON 


miaa maioio 


Senne EERE SASTA SSS 
$3559 SSEESRA3SZS SESS 
aiaia as Hanne an ion 
ETT ARIQSERSSuacOLO3curgn m 
EERE 5 SRESR REXRS E3555 SRIAN FLOA 
AOS HN AN ANN Wada inimici dada 
ERES WUNLG FHHHH RAH SARSH usd 0 
SS55-8 SREDA SASKS RSIS SASK GBERA GSAS 
MED ODIO ANNAME AACE POCELI PAPEI] HANN mm OUO 
DESEE BESET SSNS Danes KENSS SaSuE SuSaD 
MAAN SHOOT AMON Dra RON A cor ous 
ainne ANNA NNSS NN ANN A rcicicy 
Dosa Sag 2389 Sah Bag SSNee Senos 
EEELEI 8883783 $8238 $8553 ESSRS $8859 SHAR 
cis dd AANA NNSS AN NNN riri riciaicy 
Som SIZSSESUZSSERSEEESZSSESSSSEZSE 
EEEEEIETS BEI EEEEEDEBEBRCEERIBELEL EE Bic 
cies aso cicies RH NN a icio eie AN iiaia 
FEJE REEE REESE ESESESEIES RESISTE ESI EI 
BRASS SERZS SUBEN SNRA ONDON N REEFF 
eic idc Picked a A CIES SEP EE 
Ss ZSGUSERESSLEIIOEÓGSIGS59 58939 
SERES EEEEEEEEREEEPCEEEDEDE E EE ERSES 
cleo ciciei did ANSA AA iN i ieaie] 
z RSDSE CAROS MAHON SHOTS SHENG Banas 
23355 88229 NIARN ARITE SBRN BRAR SAST BABS 
EM cies iG cloles i Cice ced ANN cd i eo NASo 
WEOSIROSSGHgH GEN SE EOS 
98838 25223 ADRIA ARDS SREST SRONA SSRN 
eic coi NEAS Slated ad ASA ANH ri een ed rini ese 


HS manos 
19S SHAN 


HOG Ao 


EXXBEEJSERDI 
A a 


cic ids Na 


UG 


EESEEESES FC SHRSS SSIS 
EEEEFEREGEEECEERE ern 


icles dis i ici coco oeiee x 


ao 
as 


SOs A 


SOM ROD 
SBS Donon 


ESSI 


ROS Basse 
REESE 89*28 


EE HS 


98285 83859 S5ERÓ $888528 


REG "ai NM Cede M ceded as 


Zon BO BRe 


“ON 
BtCe awSas 


RtOQo raro 
SAA 


Oo 
o 


c-o9n nodo 


EI-T-EEESISTE TE HAG 
EBEEERIEEEREEPEREERE 


ANS AHN eroi 00 eie ore 


10 
12 


15 
20 


30 
60 
120 


It is here published with the 


trika, Vol. 33 (1943). 
ü 


iometr: 


M. Thompson, and the editor of B: 


rine 


ts E the eae beta distribution,” Biome 


Tut 


ERU 
rrington and 


f the authors, Maxine Mei 


ission of 


* This table is abridged ERR "Tables of 


kind permi 


za D 


AR 
4 s 


UC 


INDEX 


A 


Analysis, of covariance, 350, 363 
adjusted means, 356 
of variance, 318 
(See also Components of variance 
and Distribution-free tests) 
Greco-Latin squares, 341 
Latin squares, 339 
in linear regression, 318 
mixed models, 348 
one-factor experiments, 323, 364 
randomized blocks, 329 
three-factor experiments, 337, 346 
two-factor experiments, 329, 334, 
342, 345 
Average outgoing quality, 384 
Average sample size in sequential tests, 
372, 379 


B 


Beta distribution, 115 
Bias, 132, 149, 255 
Binomial distribution, 54 
confidence limits for p, 233 
cumulative form, 235 
normal approximation, 139 
Bivariate normal distribution, 165 
moment generating function for, 166 


Cc 


Cauchy distribution, 117, 216 
Central limit theorem, 136 
Chi-square distribution, 199 
Chi-square tests, 271 
contingency tables, 276, 280, 281 
distribution-free methods, 395, 398, 
402, 405 
goodness-of-fit, 270 
Combinations, 10, 12 


Combinatorial generating functions, 19 
Components of variance, 342 

mixed model, 348 

one-factor experiments, 364 

three-factor experiments, 346 

two-factor experiments, 342, 345 
Conditional distributions, 50, 52, 83 

bivariate normal, 168 

continuous, 83 

discrete, 50, 52 

multivariate normal, 181 
Conditional probability, 23, 26, 32, 50 
Confidence intervals, 220, 222 

difference between means, 267 

general method for, 229 

large sample, 235 

mean of a normal population, 224 

p of binomial population, 233 

range of rectangular population, 241 

regression coefficients, 295, 304 

variance of normal population, 226 

variance ratio, 243 

(See also Distribution-free confidence 

intervals) 

Confidence regions, 223 

large sample, 237 

for mean and variance, 227 

for regression coefficients, 206 
Consistency of an estimate, 149 
Contingency tables, 273 

tests for independence in, 276, 281, 

287, 288 

Continuous distributions, 65, 68 
Control chart, 362 
Correlation, 103, 167, 189 

distribution of estimator, 314 

multiple, 191, 314 

partial, 190 

Spearman's rank, 417 
Covariance, 103, 167, 189 

analysis of, 350, 363 


Critical region, 247 
429 


INDEX 


Cumulants, 105, 123 
Cumulative distributions, 76, 81 
Curve, regression, 169 
operating-characteristic of, 376 
Curve fitting, method of least squares 
for, 309 
method of moments vs. maximum 
likelihood, 161 


D 


Degrees of freedom, 200, 205, 206 
Density functions, 44, 46, 81 
Difference between means, confidence 
limits for, 267 
distribution of, 267 
tests of, 263 
Discrete distributions, 44, 47 
Discrimination, problem of, 299 
Distribution-free confidence intervals, 
difference between medians, 395 
median of, 388 
percentage points, 389 
regression coefficients, 408 
Distribution-free methods, 385 
estimate of medians, 388 
estimate of percentage points, 388 
estimate of regression coefficients, 
406, 409 
for factorial experiments, 398, 399, 
402 
general linear regression, 408 
large sample, 889, 393, 396, 414 
simple regression, 406 
Distribution-free tests, association, 410, 
417 
corner test, 410 
equality of distributions, 391, 394 
equality of medians, 394 
interaction, 405 
median, 390 
one-factor experiments, 398 
percentage points, 390 
regression coefficients, 407, 409 
run test, 391 
two-factor experiments, 399, 402 
Distributions, 44, 47, 81 
beta, 115 
binomial, 54 
bivariate normal, 165 


Distributions, Cauchy, 117 
chi-square, 199 
continuous, 65, 68 
cumulative, 76, 81 
discrete, 44, 47, 50, 52 
F, 204 
gamma, 112 
Gram-Charlier, 118 
hypergeometric, 61 
linear function of normal variates, 

218 

multinomial, 58 
multivariate, 47, 74 
multivariate normal, 177 
normal, 108 
Pearson, 118 
Poisson, 59 
sample, 128 
“Student’s,” 206 
t, 206 
uniform, 107 
variance ratio, 204 

(See also Sampling distributions) 


E 


Error, Type I, 246 
"Type II, 247 

Estimation, of parameters, 147 
efficiency of, 149, 150 
maximum likelihood, 154 
method of moments, 161 
unbiased, 149 

Expected values, 91 

Experiments, design of, 1, 316 


F 


F distribution, 204 
Factorial moments, 100 
Fiducial probability, 222 
Finite populations, sampling from, 130, 
146 
Forms, quadratic, 177 
Functions, density, 44, 81 
distribution, 81 
likelihood, 154 
moment generating, 100 
power, 248, 369 
regression, 190, 291 


INDEX 


G 


Gamma distribution, 112 
Goodness-of-fit test, 270 

Gram-Charlier series, 118 
Greco-Latin squares, 341 


H 


Hermite polynomials, 119 
Homogeneity of variances, test of, 269 
Homoscedasticity, 324 
Hypotheses, composite, 256 

linear, 305 

null, 245 

simple, 256 

(See also Tests of hypotheses) 


I 


Independence, functional, 49, 50 

in contingency tables, 273 

in probability sense, 34, 50, 85 

of sample mean and variance, 201 
Inspection, sampling, 375 
Interaction, in analysis of variance, 

335, 339, 343, 349, 405 
in contingency tables, 275 


J 


Joint distribution, 75 
Joint moments, 102 


L 


Large samples, 136 
confidence limits from, 235 
confidence regions from, 237 
distribution of estimators, 208 
of likelihood ratio, 259 
of mean, 136 
Latin squares, 339 
Law of large numbers, 133 
Least squares, 309 
Likelihood-ratio tests, 257 
large-sample distribution for, 259 
Linear functions of normal variates, dis- 
tribution of, 218 
Linear regression, 201 


M 


Marginal distributions, 50, 82 
continuous, 82 
discrete, 50 
Marginal probability, 23, 24 
Matrices, 170 
algebra of, 171 
inverse of, 172, 175 
variance-covariance, 176 
Maximum likelihood, principle of, 152, 
153 
Maximum-likelihood estimators, 152, 
154 
large-sample distribution of, 208 
properties of, 158 
Mean, confidence limits for, 224 
distribution of, 136, 259 
population, 93 
sample, 130 
tests of, 259, 263 
Median, 94, 387 
Mendelian inheritance, 41, 42, 286, 287 
Moment generating function, 100 
for chi-square distribution, 200 
factorial, 102 
for gamma distribution, 115 
for normal distribution, 112, 166, 184 
for Poisson distribution, 101 
for several variates, 103 
Moment problem, 103 
Moments, 93 
estimators of, 132, 160 
factorial, 100. 
joint, 102 
population, 93 
sample, 130 
Multinomial distribution, 58 
Multiple correlation, 191 
Multivariate distributions, 47, 74 
Multivariate normal distribution, 177 
estimators for parameters in, 186 
marginal and conditional distribu- 
tions for, 181 
moment generating function for, 184 


N 


Nonparametrie methods, 385 
(See also Distribution-free methods) 


Normal distribution, 108 


431 


INDEX 


Normal distribution, bivariate, 165 
conditional forms, 168, 181 
distribution, of sample mean, 198 

of sample variance, 204 

independence of sample mean and 
variance, 201 

marginal forms, 168, 181 

moment generating function for, 112, 
166, 184 

multivariate, 177 

regression functions for, 169, 184 

role of, 142 

Null hypothesis, 245 


[9] 


Operating-characteristic curve, 376 
Order statisties, 385 

Orthogonal polynomials, 313 
Orthogonal tests, 321 


D 


Parameter space, 255 
Parameters, 55 
Partial correlation, 190 
Partitions, of numbers, 19 

of sums of squares, 319, 324, 331, 335 
Pearson's chi-square tests, 271, 280 
Pearson's curves, 118 
Permutations, 10, 11 
Poisson distribution, 59 
Populations, 126 
Power, of the test, 248 

function of a test, 253, 369 
Prediction interval, 297, 304 
Principle of maximum likelihood, 152, 

153 

Probability, 8 

conditional, 23, 26, 32 

empirical, 36 

fiducial, 222 

laws of, 27 

marginal, 23, 24 
Probability density function, 44, 81 


Q 


Quadratic forms, 177 
Quality control, 361, 362 


R 


Random sampling, 126, 128 
Randomization, 317 
Randomized blocks, 329 
Range, interquartile, 387 
Regression, 289, 406, 408 
coefficient, 295 
curve, 169 
function, 190, 291 
linear, 291, 408 
multiple, 301 
normal, 291, 307 
variance about, 190 


Runs, 391 
S 
Sample, 126 
distributions, 128, 192 
mean, 130 


moments, 130 
random, 126-198 
Sampling distributions for, difference of 
two means, 218, 266 
likelihood ratio, 259 
maximum likelihood estimators, 212 
mean of large samples, 136 
of samples from binomial popula- 
tion, 206 
of samples from normal population, 
198 
of samples from Poisson popula- 
tion, 206 
order statisties, 386 
ratio of sample variances, 204 
regression coefficients, 292, 302 
sum of squares, 199 
variance of a sample, 203 
Sampling inspection, 375 
double, 377 
sequential, 377 
single, 375 
Sequential tests, 366 
for binomial, 378 
fundamental identity for, 384 
for mean of normal population, 374, 
- 380, 383 
power functions for, 369, 383 
sample size in, 372 


432 


INDEX 


Significance level, 247 

Standard deviation, 95 

Statistical inference, 3, 124 

Statistical tests (see Tests of hypothe- 
ses) ] 

Stirling’s formula, 16 

"Student's" ¢ distribution, 206, 218 

Sufficient: estimators, 151 

Sum of squares, distribution of, 199 

partition of, 319, 324, 331, 335 


av 


1 distribution, 206, 217, 218 
Tchebysheff’s inequality, 135 
"Test, unbiased, 255 
uniformly most powerful, 253 
"Tests of hypotheses, 245 
additivity of means, 335, 345 
distribution-free (see Distribution- 
free tests) 
equality-of-means, 263 
goodness-of-fit, 270 
homogeneity of variances, 268, 269 
independence in contingency tables, 
273 
large-sample, 257 
likelihood-ratio, 257 
linearity, 321 
mean of normal population, 259 
null hypotheses, 245 
one-sided, 262 
ratio of variances, 268 
sequential, 366° 


Tests of hypotheses, two-sided, 262 
variance of normal population, 267 
(See also Distribution-free methods) 
Three-factor experiments, 337, 339 
analysis of variance, 337 
components of variance, 346 
Transformations, 107, 192 
Truncated normal distribution, 243 
Two-factor experiments, 329 
analysis, of covariance, 350 
of variance, 334 
components of variance, 342 
distribution-free analysis, 399, 402 
Type I and II errors, 246 


U 


Unbiased estimators, 149 
Unbiased test, 255 

Uniform distribution, 107 
Uniformly most powerful test, 253 


V 


Variance, 94 

analysis of, 318 

distribution of sample, 203 

estimate of, 156 

of linear function, 189 

about regression function, 190 

of sample mean, 133 

test of homogeneity of several vari- 

ances, 269 

Variance-covariance matrix, 176 
Variate, 46, 65 


Form No. 3. 


' PSY, RES.L-1 + 


. Bureau of Educational & Psychological 


Research Library. 
——— 


The book is to be returned within 
the. date stamped last. 


9 TEER 108 


LIED DIDI 


3 OMAR: in 


2 Gar Duct 


WBGP-59/60-51190-5M 


